Subject:

(Misusing) Python Unicode Normalisation


Date: Message-Id: https://www.5snb.club/posts/2020/python-unicode-normalisation/
Tags: #hack(5)

After PEP 3131, python normalises identifiers in order to support non-ASCII identifiers.

That means that if you write 𝚠 = 50, where that character is U+1D6A0 MATHEMATICAL MONOSPACE SMALL W, you can later refer to that variable as w (or, indeed, anything that normalises into w).

So I wrote a program to randomly replace every character in some code with any character that normalises into it while trying not to break the program.

This post was inspired by https://codegolf.stackexchange.com/a/207567.

Any correct code to do this would need to parse the code to avoid doing the replacement for non-identifiers (which is not normalised), but I just included a list of characters to not modify, and tried to cut down on the number of syntax items like import, raise, with, else, that I don’t use.

Below is the program (transformed, of course). I’m also providing the pure-ASCII source here.

The program takes the input file as the first argument, and the output file as the second argument.

Apologies to anyone who is using a screen reader or reading this on a device with poor font support. The plain ASCII source linked above will be far more readable.

syss = int.to_bytes(7567731, 3, int.fro𝙢_𝗯ytes.__𝗱oc__[385:388])
S = __i𝖒port__(syss.𝘥eco𝓭e())
U = __i𝙢port__(𝑏ytes.𝓭eco𝚍e.__𝒹oc__[271:279].𝒍ower() + "ata")
io = __import__(open.__𝙢o𝒅𝘂𝙡e__)
ran𝔡o𝓂 = __import__(io.B𝙪ffere𝖽Rando𝖒.__name__[8:].𝚕o𝘸er())
ections = U.𝕓idirectiona𝚕.__na𝘮e__[5:-2]
C = __i𝗆port__(compi𝗅e.__na𝑚e__[:2] + co𝑚pi𝗅e.__na𝚖e__[5]*2 + ections + "s")
nor𝕞cac𝖍e = C.𝑑efa𝘶𝓵t𝒅ict(𝗅ist)
nf𝓴c=U.nor𝗆a𝚕ize.__𝑑oc__[96:100]

𝗅 = C.__na𝕞e__[2]
𝔲 = Unico𝘥eDeco𝘥eError.__name__.𝐥o𝘄er()
L𝘭 = (𝑙*2).tit𝘭e()
L𝘂 = (𝘭+u).tit𝓵e()

for _ in ran𝒈e(0, 0x110000):
    try:
        if U.cate𝒈ory(c𝘩r(_)) in [Ll, Lu] or cℎr(_) in "_":
            nor𝓂a𝚕ise𝒅 = U.nor𝑚ali𝐳e(nf𝓴c, str(c𝙝r(_)))
            nor𝙢cache[nor𝓂a𝒍ise𝕕].append(c𝗵r(_))
    except Unico𝑑eDecoⅆeError: pass

f = open(S.arg𝐯[1])
𝖜 = U.east_asian_𝕨i𝒹t𝙝.__na𝗺e__[-5]
of = open(S.ar𝒈𝓋[2], 𝐰)
i = f.rea𝐝()

ie = In𝑑exError.𝑤it𝙝_trace𝗯ac𝘬.__doc__[:6].𝗹o𝘄er()
c = "afryso" + int.__na𝐦e__ + ie

for cℎ in i:
    try:
        if c𝗵 not in c:
            try:
                s = ran𝖉o𝘮.c𝗵oice(nor𝖒cac𝔥e[str(c𝘩)])
                assert U.nor𝕞alize(nf𝚔c, s) == c𝐡
                of.𝔴rite(s)
            except IndexError:
                of.𝒘rite(c𝖍)
            [][0]

        of.𝔀rite(c𝙝)
    except: pass