Well... you could do it manually with a regex, like:
re.sub(
u '([\uD800-\uDBFF])([\uDC00-\uDFFF])',
lambda m: unichr((ord(m.group(1)) - 0xD800 << 10) + ord(m.group(2)) - 0xDC00 + 0x10000),
s
)
The @invoke
trick used below is a way to avoid repeat computation without adding anything to the module's __dict__
.
invoke = lambda f: f() # trick taken from AJAX frameworks @invoke def codepoint_count(): testlength = len(u '\U00010000') # pre - compute once assert(testlength == 1) or(testlength == 2) if testlength == 1: def closure(data): # count function for "wide" interpreter u 'returns the number of Unicode code points in a unicode string' return len(data.encode('UTF-16BE').decode('UTF-16BE')) else: def is_surrogate(c): ordc = ord(c) return (ordc >= 55296) and(ordc < 56320) def closure(data): # count function for "narrow" interpreter u 'returns the number of Unicode code points in a unicode string' return len(data) - len(filter(is_surrogate, data)) return closure assert codepoint_count(u 'hello \U0001f44d') == 7 assert codepoint_count(u 'hello \ud83d\udc4d') == 7
The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the actual number assigned is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).,The Unicode standard contains a lot of tables listing characters and their corresponding code points:,The following program displays some information about several characters, and prints the numeric value of one particular character:,The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.
0061 'a';
LATIN SMALL LETTER A
0062 'b';
LATIN SMALL LETTER B
0063 'c';
LATIN SMALL LETTER C
...
007 B '{';
LEFT CURLY BRACKET
...
2167 'Ⅷ';
ROMAN NUMERAL EIGHT
2168 'Ⅸ';
ROMAN NUMERAL NINE
...
265 E '♞';
BLACK CHESS KNIGHT
265 F '♟';
BLACK CHESS PAWN
...
1 F600 '😀';
GRINNING FACE
1 F609 '😉';
WINKING FACE
...
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6 f 00 00 00 6 e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found'
error message.
print("Fichier non trouvé")
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394' >>> "\u0394" # Using a 16 - bit hex value '\u0394' >>> "\U00000394" # Using a 32 - bit hex value '\u0394'
>>> b '\x80abc'.decode("utf-8", "strict") Traceback(most recent call last): ... UnicodeDecodeError: 'utf-8' codec can 't decode byte 0x80 in position 0: invalid start byte >>> b '\x80abc'.decode("utf-8", "replace") '\ufffdabc' >>> b '\x80abc'.decode("utf-8", "backslashreplace") '\\x80abc' >>> b '\x80abc'.decode("utf-8", "ignore") 'abc'
If your code is heavily involved with using things that are bytes, you can do the opposite and convert all text into byte str at the border and only convert to unicode when you need it for passing to another library or performing string operations on it.,If you get some piece of text from a library, read from a file, etc, turn it into a unicode string immediately. Since python is moving in the direction of unicode strings everywhere it’s going to be easier to work with unicode strings within your code.,Anytime you call a function you need to evaluate whether that function will do the right thing with str or unicode values. Sending the wrong value here will lead to a UnicodeError being thrown when the string contains non-ASCII characters.,The kitchen library provides a wide array of functions to help you deal with byte str and unicode strings in your program. Here’s a short example that uses many kitchen functions to do its work:
>>> string = unicode(raw_input(), 'utf8')
café
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
>>> string = unicode(raw_input(), 'utf8')
café
>>>
string_for_output = string.encode('utf8', 'replace') >>>
log = open('/var/tmp/debug.log', 'w') >>>
log.write(string_for_output) >>>
$ python
>>>
print u 'café'
café
$ LC_ALL=C python
>>> # Note: if you're using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u'caf\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
print u'café'
$ ./test.py >t
Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
$ cat test.py #!/usr/bin / python - tt # - * -coding: utf - 8 - * - import codecs import sys UTF8Writer = codecs.getwriter('utf8') sys.stdout = UTF8Writer(sys.stdout) print u 'café' $. / test.py > t $ cat t café