changing case of letters in unicode string containing accent and local letters

  • Last Update :
  • Techknowledgy :

Python currently doesn't have any support for locale-specific case folding, or the other rules in Unicode SpecialCasing.txt. If you need it today, you can get them from PyICU.

>>> unicode(icu.UnicodeString(u 'IK').toLower(icu.Locale('TR')))
u 'ık'

Suggestion : 2

No. Some letters (notably those in the IPA block) have no matching case equivalent. As a result, uppercasing a string may not eliminate all of the lowercase letters in it.,Does uppercasing of a string eliminate all of the lowercase letters in it?,Q: Does uppercasing of a string eliminate all of the lowercase letters in it?,Some characters have multiple characters that map to them. For example, in the Greek script, capital sigma (U+03A3) is the uppercase form of both the regular (U+03C2) and final (U+03C3) lowercase sigma.

Near the end of the SpecialCasing.txt, there are two lines that are commented out pertaining to the Greek letter sigma. At first glance, they may look a bit odd:

# 03 C3;
03 C2;
03 A3;
03 A3;
FINAL;
# GREEK SMALL LETTER SIGMA
# 03 C2;
03 C3;
03 A3;
03 A3;
NON_FINAL;
# GREEK SMALL LETTER FINAL SIGMA

Suggestion : 3

Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. ,Pro/Con: although you can use the syntax L"Hello, world." to easily include wide-character strings in C programs, the size of wide characters is not consistent across platforms (some incorrectly use 2-byte wide characters) , You are free to choose a string encoding for internal use in your program. The choice pretty much boils down to either UTF-8, wide (4-byte) characters, or multibyte. Each has its advantages and disadvantages: ,A missing or corrupt byte in transmission can only affect a single character—you can always find the start of the sequence for the next character just by scanning a couple bytes.

L "Hello, world."

int valid_identifier_start(char ch) {
   return ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z'));
}
int valid_identifier_start(char ch)
{
    return ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z'));
}
int valid_identifier_start(char ch) {
   return ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') ||
      ((unsigned char) ch >= 0xC0));
}

Suggestion : 4

This HOWTO discusses Python’s support for the Unicode specification for representing textual data, and explains various problems that people commonly encounter when trying to work with Unicode.,To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables.,The Unicode standard contains a lot of tables listing characters and their corresponding code points:,The Unicode Consortium site has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. A chronology of the origin and development of Unicode is also available on the site.

0061 'a';
LATIN SMALL LETTER A
0062 'b';
LATIN SMALL LETTER B
0063 'c';
LATIN SMALL LETTER C
   ...
   007 B '{';
LEFT CURLY BRACKET
   ...
   2167 'Ⅷ';
ROMAN NUMERAL EIGHT
2168 'Ⅸ';
ROMAN NUMERAL NINE
   ...
   265 E '♞';
BLACK CHESS KNIGHT
265 F '♟';
BLACK CHESS PAWN
   ...
   1 F600 '😀';
GRINNING FACE
1 F609 '😉';
WINKING FACE
   ...
   P y t h o n
   0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6 f 00 00 00 6 e 00 00 00
   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
try:
with open('/tmp/input.txt', 'r') as f:
   ...
   except OSError:
   # 'File not found'
error message.
print("Fichier non trouvé")
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
   f.write("test\n")
>>> "\N{GREEK CAPITAL LETTER DELTA}"
# Using the character name
   '\u0394' >>>
   "\u0394"
# Using a 16 - bit hex value '\u0394' >>>
   "\U00000394"
# Using a 32 - bit hex value '\u0394'
>>> b '\x80abc'.decode("utf-8", "strict")
Traceback(most recent call last):
   ...
   UnicodeDecodeError: 'utf-8'
codec can 't decode byte 0x80 in position 0:
invalid start byte
   >>>
   b '\x80abc'.decode("utf-8", "replace")
'\ufffdabc' >>>
b '\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc' >>>
b '\x80abc'.decode("utf-8", "ignore")
'abc'

Suggestion : 5

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.,The StringConverter program starts by creating a String containing Unicode characters:,The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.,The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:

String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
AêñüC
try {
   byte[] utf8Bytes = original.getBytes("UTF8");
   byte[] defaultBytes = original.getBytes();

   String roundTrip = new String(utf8Bytes, "UTF8");
   System.out.println("roundTrip = " + roundTrip);
   System.out.println();
   printBytes(utf8Bytes, "utf8Bytes");
   System.out.println();
   printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
   e.printStackTrace();
}
public static void printBytes(byte[] array, String name) {
   for (int k = 0; k < array.length; k++) {
      System.out.println(name + "[" + k + "] = " + "0x" +
         UnicodeFormatter.byteToHex(array[k]));
   }
}
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43