How to handle Japanse with Python (12)

This is about chapter 12 of the whale book. This capter is only available in Japanease version. I still keep to write in English as this might be helpful for other double-byte character languages.

Of course, I will continue other chapters (currently at chapter 4) in parallel.

Prepare for using unicode.

>>> import nltk, re, sys
>>> import codecs
>>> from __future__ import division
>>> sys.stdout = codecs.getwriter('utf_8')(sys.stdout)
>>> sys.stdin = codecs.getreader('utf_8')(sys.stdin)
>>> f = codecs.open('something.txt', 'r', 'utf-8')
>>> print f
<open file 'something.txt', mode 'rb' at 0x0000000004183C90>
>>>
>>> print "%s で %s" % (u"パイソン", u"自然言語処理")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal
not in range(128)

This error is expected one. Then change the default encoding.

>>> sys.getdefaultencoding()
'ascii'
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')

Let's try again.

[code langauge="python"]
>>> print "%s で %s" % (u"パイソン", u"自然言語処理")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 3: invalid start byte
|