Unicode text processing (3.3) - Deutschina's Tech Diary

In my case, I will handle double-byte languages like Japanese and Chinese. In terms of that, Unicode handling will be mandatory. Chapter 3.3 of the whale book is for Unicode handling.

>>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
>>> import codecs
>>> f = codecs.open(path, encoding='latin2')
>>>

unicode_escape is Python specific "dummy" encoding. All non ASCII characters are displayed like u\XXXX and characters whose code point is between 128 and 256 are like \xXX.

>>> for line in f:
...     line = line.strip()
...     print line.encode('unicode_escape')
... 
Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y
odnalezione po 1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki
Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 500 tys. zabytkowych
archiwali\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
>>>

If the character is out of ASCII range, it is necessary to do encoding to display.

>>> ord('a')
97
>>> a = u'\u0061'
>>> a
u'a'
>>> print a
a
>>> nacute = u'\u0144'
>>> nacute
u'\u0144'
>>> nacute_utf = nacute.encode('utf8')
>>> print repr(nacute_utf)
'\xc5\x84'
>>> print nacute_utf
&#324;
>>>

I understand this process but not clear where "%r" and "%s" come from...

>>> import unicodedata
>>> lines = codecs.open(path, encoding='latin2').readlines()
>>> line = lines[2]
>>> print line.encode('unicode_escape')
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n
>>> for c in line:
...     if ord(c) > 127:
...             print '%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))
... 
'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE
>>> for c in line:
...     if ord(c) > 127:
...             print '%s U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))
... 
&#243; U+00f3 LATIN SMALL LETTER O WITH ACUTE
&#347; U+015b LATIN SMALL LETTER S WITH ACUTE
&#346; U+015a LATIN CAPITAL LETTER S WITH ACUTE
&#261; U+0105 LATIN SMALL LETTER A WITH OGONEK
&#322; U+0142 LATIN SMALL LETTER L WITH STROKE
>>>

Oops, it's shame of mine. I searched again and got following results.

%s means string
%r is to display converted result of repr()
%04x for Hexadecimal numbers

Then this statement means:

print '%s U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))

%s --> c.encode('utf8')
%04x --> ord(c)
%s --> unicodedata.name(c)

In the first example, the first %s was %r was used. Therefore the output was something like '\xc3\xb3'.

Now it's clear. Let's continue.

>>> line.find(u'zosta\u0142y')
54
>>> line = line.lower()
>>> print line.encode('unicode_escape')
niemc\xf3w pod koniec ii wojny \u015bwiatowej na dolny \u015bl\u0105sk, zosta\u0142y\n
>>> import re
>>> m = re.search(u'\u015b\w*', line)
>>> m.group()
u'\u015bwiatowej'
>>> 
>>> nltk.word_tokenize(line)
[u'niemc\xf3w', u'pod', u'koniec', u'ii', u'wojny', u'\u015bwiatowej', u'na', u'dolny', u'\u015bl\u0105sk', u',', u'zosta\u0142y']
>>>

If it would be apparent to use specific encoding. This line can be added at the first line of the source code.

# -*- coding: utf-8 -*-