The behaviour depends on Python version and the environment. On Python 3 the character encoding error handler for sys.stderr
is always 'backslashreplace'
:
from __future__
import unicode_literals, print_function
import sys
s = 'unicode "\u2323" smile'
print(s)
print(s, file = sys.stderr)
try:
raise RuntimeError(s)
except Exception as e:
print(e.args[0])
print(e.args[0], file = sys.stderr)
raise
python3:
$ PYTHONIOENCODING=ascii:ignore python3 raise_unicode.py
unicode "" smile
unicode "\u2323" smile
unicode "" smile
unicode "\u2323" smile
Traceback (most recent call last):
File "raise_unicode.py", line 8, in <module>
raise RuntimeError(s)
RuntimeError: unicode "\u2323" smile
$ PYTHONIOENCODING=ascii:ignore python2 raise_unicode.py
unicode "" smile
unicode "" smile
unicode "" smile
unicode "" smile
Traceback (most recent call last):
File "raise_unicode.py", line 8, in <module>
raise RuntimeError(s)
RuntimeError
For comparison:
$ python3 raise_unicode.py
unicode "⌣" smile
unicode "⌣" smile
unicode "⌣" smile
unicode "⌣" smile
Traceback (most recent call last):
File "raise_unicode.py", line 8, in <module>
raise RuntimeError(s)
RuntimeError: unicode "⌣" smile
Example of the problem with output (Python 2.7, linux):
# - * -coding: utf - 8 - * -
desc = u 'something bad with field ¾'
raise SyntaxError(desc.encode('utf-8', 'replace'))
It will print only truncated or screwed message:
~/.../sources/C_patch$ python SO.py
Traceback (most recent call last):
File "SO.py", line 25, in <module>
raise SyntaxError(desc)
SyntaxError
To actually see the unaltered unicode, you can encode it to raw bytes and feed into exception object:
# - * -coding: utf - 8 - * -
desc = u 'something bad with field ¾'
raise SyntaxError(desc.encode('utf-8', 'replace'))
User code can raise built-in exceptions. This can be used to test an exception handler or to report an error condition “just like” the situation in which the interpreter raises the same exception; but beware that there is nothing to prevent user code from raising an inappropriate error.,Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.,Raised when a generator or coroutine is closed; see generator.close() and coroutine.close(). It directly inherits from BaseException instead of Exception since it is technically not an error.,When a generator or coroutine function returns, a new StopIteration instance is raised, and the value returned by the function is used as the value parameter to the constructor of the exception.
raise new_exc from original_exc
try:
...
except SomeException:
tb = sys.exc_info()[2]
raise OtherException(...).with_traceback(tb)
BaseException
+
--SystemExit +
--KeyboardInterrupt +
--GeneratorExit +
--Exception +
--StopIteration +
--StopAsyncIteration +
--ArithmeticError |
+ --FloatingPointError |
+ --OverflowError |
+ --ZeroDivisionError +
--AssertionError +
--AttributeError +
--BufferError +
--EOFError +
--ImportError |
+ --ModuleNotFoundError +
--LookupError |
+ --IndexError |
+ --KeyError +
--MemoryError +
--NameError |
+ --UnboundLocalError +
--OSError |
+ --BlockingIOError |
+ --ChildProcessError |
+ --ConnectionError |
| + --BrokenPipeError |
| + --ConnectionAbortedError |
| + --ConnectionRefusedError |
| + --ConnectionResetError |
+ --FileExistsError |
+ --FileNotFoundError |
+ --InterruptedError |
+ --IsADirectoryError |
+ --NotADirectoryError |
+ --PermissionError |
+ --ProcessLookupError |
+ --TimeoutError +
--ReferenceError +
--RuntimeError |
+ --NotImplementedError |
+ --RecursionError +
--SyntaxError |
+ --IndentationError |
+ --TabError +
--SystemError +
--TypeError +
--ValueError |
+ --UnicodeError |
+ --UnicodeDecodeError |
+ --UnicodeEncodeError |
+ --UnicodeTranslateError +
--Warning +
--DeprecationWarning +
--PendingDeprecationWarning +
--RuntimeWarning +
--SyntaxWarning +
--UserWarning +
--FutureWarning +
--ImportWarning +
--UnicodeWarning +
--BytesWarning +
--EncodingWarning +
--ResourceWarning
No, I didn’t truncate that last line; raising exceptions really cannot handle non-ASCII characters in a unicode string and will output an exception without the message if the message contains them. What happens if we try to use the handy dandy getwriter() trick to work around this?,Anytime you call a function you need to evaluate whether that function will do the right thing with str or unicode values. Sending the wrong value here will lead to a UnicodeError being thrown when the string contains non-ASCII characters.,If your code is heavily involved with using things that are bytes, you can do the opposite and convert all text into byte str at the border and only convert to unicode when you need it for passing to another library or performing string operations on it.,If you use codecs.getwriter() on sys.stderr, you’ll find that raising an exception with a byte str is broken by the default StreamWriter as well. Don’t do that or you’ll have no way to output non-ASCII characters. If you want to use a StreamWriter to encode other things on stderr while still having working exceptions, use kitchen.text.converters.getwriter().
>>> string = unicode(raw_input(), 'utf8')
café
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
>>> string = unicode(raw_input(), 'utf8')
café
>>>
string_for_output = string.encode('utf8', 'replace') >>>
log = open('/var/tmp/debug.log', 'w') >>>
log.write(string_for_output) >>>
$ python
>>>
print u 'café'
café
$ LC_ALL=C python
>>> # Note: if you're using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u'caf\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
print u'café'
$ ./test.py >t
Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
$ cat test.py #!/usr/bin / python - tt # - * -coding: utf - 8 - * - import codecs import sys UTF8Writer = codecs.getwriter('utf8') sys.stdout = UTF8Writer(sys.stdout) print u 'café' $. / test.py > t $ cat t café
The font you're using has to have a glyph for the character,The character encoding you're using has to be correct.,The number (or codepoint) of the character has to be correct,We've now covered two of the three things you need to get a Unicode character to display properly:
This means that all the numbers from 0 to 127 can be encoded using seven 0s or 1s. For instance, the space, 32, is:
010 0000
Below is a portion of the Unicode chart for Cyrillic. Each character has a number, a glyph (what it looks like onscreen), and a name (what it is). It's the name and number that define a unique Unicode character, not the glyph--- a Cyrillic user writes Cyrillic capital A and an English user writes English A--- even though these are in some sense the same character.
Basic Russian alphabet
0410 А CYRILLIC CAPITAL LETTER A
0411 Б CYRILLIC CAPITAL LETTER BE→ 0183 ƃ latin small letter b with topbar
0412 В CYRILLIC CAPITAL LETTER VE
0413 Г CYRILLIC CAPITAL LETTER GHE
0414 Д CYRILLIC CAPITAL LETTER DE
0415 Е CYRILLIC CAPITAL LETTER IE
0416 Ж CYRILLIC CAPITAL LETTER ZHE
0417 З CYRILLIC CAPITAL LETTER ZE
0418 И CYRILLIC CAPITAL LETTER I
0419 Й CYRILLIC CAPITAL LETTER SHORT I≡ 0418 И 0306 $̆
041 A К CYRILLIC CAPITAL LETTER KA
041 B Л CYRILLIC CAPITAL LETTER EL
041 C М CYRILLIC CAPITAL LETTER EM
041 D Н CYRILLIC CAPITAL LETTER EN
041 E О CYRILLIC CAPITAL LETTER O
041 F П CYRILLIC CAPITAL LETTER PE
0420 Р CYRILLIC CAPITAL LETTER ER
0421 С CYRILLIC CAPITAL LETTER ES
0422 Т CYRILLIC CAPITAL LETTER TE
Here are some alchemical symbols:
Symbols for antimony, antimony ore and derivatives 1 F72B ALCHEMICAL SYMBOL FOR ANTIMONY ORE = stibnite→ 2641♁ earth 1 F72C ALCHEMICAL SYMBOL FOR SUBLIMATE OF ANTIMONY→ 1 F739 alchemical symbol for sal - ammoniac 1 F72D ALCHEMICAL SYMBOL FOR SALT OF ANTIMONY = cinnabar→ 1 F714 alchemical symbol for salt 1 F72E ALCHEMICAL SYMBOL FOR SUBLIMATE OF SALT OF ANTIMONY 1 F72F ALCHEMICAL SYMBOL FOR VINEGAR OF ANTIMONY 1 F730 ALCHEMICAL SYMBOL FOR REGULUS OF ANTIMONY = antimony metal 1 F731 ALCHEMICAL SYMBOL FOR REGULUS OF ANTIMONY - 2
So what's a character encoding? In ASCII, we could write each character as a seven (or eight) bit number. Unicode has numbers up to 1114111, which would require 21 bits each. (For the same reason as we rounded up 7 to 8, in practice this would be rounded up to 32.) But this is an awful waste of space... here's the tutorial's example:
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6 f 00 00 00 6 e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
For instance, here's some Japanese text, as rendered by my terminal program:
24-1 <A4>
<CA>
<A4>
<C0> Ƚ<C4>
<EA>
<BB>
<EC> * Ƚ<C4>
<EA>
<BB>
<EC>
<A5>
<C0>
<CE>
<F3>
<B4>
<F0>
<CB>
<DC>Ϣ<C2>η<C1>
25-2 <A4>Τ<C7>
<A4>Τ<C0>
<BD>
<F5>ư<BB>
<EC> * <A5>ʷ<C1>
<CD>ƻ<EC>
<A5>
<C0>
<CE>
<U+E5FF7>
<CF>Ϣ<CD>ѥƷ<C1>
27-1 <A1>
<A2> * <C6>ü<EC>
<C6>
<C9>
<C5>
<C0> * *
* 6 8D
+ 6 8D
<rel type="<A2><E2>" target="ǯƬ" sid="950101008-001" tag="2" />
And here's some French:
D<E9>bats du S<E9>nat (hansard)
2e Session, 36e L<E9>gislature,
Volume 138, Num<E9>ro 53
Le mardi 9 mai 2000
L'honorable Gildas L. Molgat, Pr<E9>sident
Table des mati<E8>res
D<C9>CLARATIONS DE S<C9>NATEURS
Since most Unix tools default to UTF-8 these days, mojibake is generally a signal that your text is using one of these alternate encodings. Web browsers are somewhat more relaxed and can mostly deal with the ISO-Latin standard as well. Mine displays the French correctly, but for the Japanese text, it merely produces some slightly different-looking mojibake (created by interpreting the Japanese as if it were ISO-Latin1):
# S-ID:950101003-001 KNP:96/10/27 MOD:2005/03/08
* 0 26D
+ 0 1D
0-2 ¤à¤é¤ä¤Þ * ̾»ì ¿Í̾ * *
2-2 ¤È¤ß¤¤¤Á * ̾»ì ¿Í̾ * *
+ 1 37D
<rel type="=" target="¼»³ÉÙ»Ô" sid="950101003-001" tag="0" />
We'll start off with a very simple snippet of Python code, although one which is at the core of all your programs so far.
import sys
for line in open(sys.argv[1]):
print(line.strip())
To try out the Latin1 encoding, modify the encoding argument to open:
for line in open(sys.argv[1], encoding = "Latin1"):
Try out your new program. This program should produce correct output for Canadian Hansards:
débats de le Sénat(hansard)
1 ère Session, 36 e Législature,
volume 137, Numéro 100
le jeudi 3 décembre 1998
le honorable Gildas L.Molgat, Président
table de les matières
visiteur de marque
DÉCLARATIONS DE SÉNATEURS
son HonneurM.Pierre Bourque
maire de Montréal
There's one final issue between you and correctly processing international corpora. What about when you need to type a string directly into your own code? For instance, we have code like this:
if speakerGender == "m":
#m
for 'male'
utterancesMen += 1
elif speakerGender == "f":
#f
for female
utterancesWomen += 1
If you couldn't, or didn't want to type the alchemical symbols in directly, you could also refer to them by name using the \N escape sequence. This is particularly useful when you don't have a font that renders the character correctly. In this case, typing it into your code would be very unclear, since you wouldn't be able to tell which character you were looking at:
>>> "\N{male sign}"
'♂' >>>
"\N{female sign}"
'♀' >>>
"\N{ALCHEMICAL SYMBOL FOR SUBLIMATE OF ANTIMONY}"
''
Last Updated : 08 Jun, 2022
Start Index: 34 End Index: 40
< _sre.SRE_Match object;
span = (0, 1), match = 'g' >
<
_sre.SRE_Match object;
span = (5, 6), match = '.' >
['123456789', '987654321']
Start Index: 34 End Index: 40
Output:
['e', 'a', 'd', 'b', 'e', 'a']
Syntax :
re.split(pattern, string, maxsplit = 0, flags = 0)
Output
S~ * ject has~ * er booked already S~ * ject has Uber booked already S~ * ject has Uber booked already Baked Beans & Spam
This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\\[a\ - 9\]\, \he\ said\\\\ ^ WoW