Unicode text processing (3.3)
In my case, I will handle double-byte languages like Japanese and Chinese. In terms of that, Unicode handling will be mandatory. Chapter 3.3 of the whale book is for Unicode handling.
>>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt') >>> import codecs >>> f = codecs.open(path, encoding='latin2') >>>
unicode_escape is Python specific "dummy" encoding. All non ASCII characters are displayed like u\XXXX and characters whose code point is between 128 and 256 are like \xXX.
>>> for line in f: ... line = line.strip() ... print line.encode('unicode_escape') ... Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105 "Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y odnalezione po 1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 500 tys. zabytkowych archiwali\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha. >>>
If the character is out of ASCII range, it is necessary to do encoding to display.
>>> ord('a') 97 >>> a = u'\u0061' >>> a u'a' >>> print a a >>> nacute = u'\u0144' >>> nacute u'\u0144' >>> nacute_utf = nacute.encode('utf8') >>> print repr(nacute_utf) '\xc5\x84' >>> print nacute_utf ń >>>
I understand this process but not clear where "%r" and "%s" come from...
>>> import unicodedata >>> lines = codecs.open(path, encoding='latin2').readlines() >>> line = lines[2] >>> print line.encode('unicode_escape') Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n >>> for c in line: ... if ord(c) > 127: ... print '%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c)) ... '\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE '\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE '\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE '\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK '\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE >>> for c in line: ... if ord(c) > 127: ... print '%s U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c)) ... ó U+00f3 LATIN SMALL LETTER O WITH ACUTE ś U+015b LATIN SMALL LETTER S WITH ACUTE Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE ą U+0105 LATIN SMALL LETTER A WITH OGONEK ł U+0142 LATIN SMALL LETTER L WITH STROKE >>>
Oops, it's shame of mine. I searched again and got following results.
%s means string
%r is to display converted result of repr()
%04x for Hexadecimal numbers
Then this statement means:
print '%s U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))
%s --> c.encode('utf8')
%04x --> ord(c)
%s --> unicodedata.name(c)
In the first example, the first %s was %r was used. Therefore the output was something like '\xc3\xb3'.
Now it's clear. Let's continue.
>>> line.find(u'zosta\u0142y') 54 >>> line = line.lower() >>> print line.encode('unicode_escape') niemc\xf3w pod koniec ii wojny \u015bwiatowej na dolny \u015bl\u0105sk, zosta\u0142y\n >>> import re >>> m = re.search(u'\u015b\w*', line) >>> m.group() u'\u015bwiatowej' >>> >>> nltk.word_tokenize(line) [u'niemc\xf3w', u'pod', u'koniec', u'ii', u'wojny', u'\u015bwiatowej', u'na', u'dolny', u'\u015bl\u0105sk', u',', u'zosta\u0142y'] >>>
If it would be apparent to use specific encoding. This line can be added at the first line of the source code.
# -*- coding: utf-8 -*-