How to handle Japanse with Python (12)
This is about chapter 12 of the whale book. This capter is only available in Japanease version. I still keep to write in English as this might be helpful for other double-byte character languages.
Of course, I will continue other chapters (currently at chapter 4) in parallel.
Prepare for using unicode.
>>> import nltk, re, sys >>> import codecs >>> from __future__ import division >>> sys.stdout = codecs.getwriter('utf_8')(sys.stdout) >>> sys.stdin = codecs.getreader('utf_8')(sys.stdin) >>> f = codecs.open('something.txt', 'r', 'utf-8') >>> print f <open file 'something.txt', mode 'rb' at 0x0000000004183C90> >>> >>> print "%s で %s" % (u"パイソン", u"自然言語処理") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal not in range(128)
This error is expected one. Then change the default encoding.
>>> sys.getdefaultencoding() 'ascii' >>> import sys >>> reload(sys) <module 'sys' (built-in)> >>> sys.setdefaultencoding('utf-8')
Let's try again.
[code langauge="python"]
>>> print "%s で %s" % (u"パイソン", u"自然言語処理")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 3: invalid start byte
|