Skip to content Skip to sidebar Skip to footer

Converting Unicode Sequence To String In Python3 But Allow Paths In String

There is at least one related question on SO that proved useful when trying to decode unicode sequences. I am preprocessing a lot of texts with a lot of different genres. Some are

Solution 1:

The input is ambiguous. The right answer does not exist in the general case. We could use heuristics that produce an output that looks right most of the time e.g., we could use a rule such as "if \uxxxx sequence (6 chars) is a part of an existing path then don't interpret it as a Unicode escape" and the same for \Uxxxxxxxx (10 chars) sequences e.g., an input that is similar to the one from the question: b"c:\\U0001f60f\\math.dll" can be interpreted differently depending on whether c:\U0001f60f\math.dll file actually exists on the disk:

#!/usr/bin/env python3import re
from pathlib import Path


defdecode_unicode_escape_if_path_doesnt_exist(m):
    path = m.group(0)
    return path if Path(path).exists() else replace_unicode_escapes(path)


defreplace_unicode_escapes(text):
    return re.sub(
        fr"{unicode_escape}+",
        lambda m: m.group(0).encode("latin-1").decode("raw-unicode-escape"),
        text,
    )


input_text = Path('broken.txt').read_text(encoding='ascii')
hex = "[0-9a-fA-F]"
unicode_escape = fr"(?:\\u{hex}{{4}}|\\U{hex}{{8}})"
drive_letter = "[a-zA-Z]"print(
    re.sub(
        fr"{drive_letter}:\S*{unicode_escape}\S*",
        decode_unicode_escape_if_path_doesnt_exist,
        input_text,
    )
)

Specify the actual encoding of your broken.txt file in the read_text() if there are non-ascii characters in the encoded text.

What specific regex to use to extract paths depends on the type of input that you get.

You could complicate the code by trying to substitute one possible Unicode sequence at a time (the number of replacements grows exponentially with the number of candidates in this case e.g., if there are 10 possible Unicode escape sequences in a path then there are 2**10 decoded paths to try).

Solution 2:

The raw_unicode_escape codec in the ignore mode seems to do the trick. I'm inlining the input as a raw byte longstring here, which should by my reasoning be equivalent to reading it from a binary file.

input = br"""
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
"""print(input.decode('raw_unicode_escape', 'ignore'))

outputs

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek, Financial Director and Director of Controlling. Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:s\math.dll (op Windows)).

Note that the \udf in d:\udfs gets mangled, as the codec attempts to start reading an \uXXXX sequence, but gives up at the s.

An alternative (likely slower) would be to use a regexp to find the valid Unicode sequences within decoded data. This assumes .decode()ing the full input string as UTF-8 is possible, though. (The .encode().decode() dance is necessary since strings can't be encoded, just bytes. One could also use chr(int(m.group(0)[2:], 16)).)

escape_re = re.compile(r'\\u[0-9a-f]{4}')
output = escape_re.sub(lambda m: m.group(0).encode().decode('unicode_escape'), input.decode()))

outputs

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek, Financial Director and Director of Controlling. Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).

Since \udf doesn't have 4 hexadecimal characters, the d:\udfs path is spared here.

Solution 3:

I had already written this code when AKX posted his answer. I still think it applies.

The idea is to capture unicode sequence candidates with a regex (and try to exclude paths, e.g. parts that are preceded with any letter and a colon (e.g. c:\udfff). If decoding does fail, we'll return the original string.

withopen('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
    lines = ''.join(fin.readlines())
    lines = lines.strip()
    lines = unicode_replace(lines)
    fout.write(lines)


defunicode_replace(s):
    # Directory paths in a text are seen as unicode sequences but will fail to decode, e.g. d:\udfs\math.dll# In case of such failure, we'll pass on these sentences - we don't try to decode them but leave them# as-is. Note that this may leave some unicode sequences alive in your text.defrepl(match):
        match = match.group()
        try:
            return match.encode('utf-8').decode('unicode-escape')
        except UnicodeDecodeError:
            return match

    return re.sub(r'(?<!\b[a-zA-Z]:)(\\u[0-9A-Fa-f]{4})', repl, s)

Post a Comment for "Converting Unicode Sequence To String In Python3 But Allow Paths In String"