NLTK handling with Cygwin

Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Usually we call this kind problem as "MOJIBAKE" in Japanese.

I have been useing the standard command prompt so far, but I occationally found a blog whose owner is using Cygwin. After installed Cygwin and I needed some adjustment to use NLTK. This time I did not change any environment varialbles in Windows, just created following file named nltk_init.py.

[code langauge="python"]
# -*- coding: utf-8 -*-
from __future__ import division
import sys
sys.path.append('/cygdrive/c/python27')
sys.path.append('/cygdrive/c/python27/dlls')
sys.path.append('/cygdrive/c/python27/lib')
sys.path.append('/cygdrive/c/python27/lib/plat-win')
sys.path.append('/cygdrive/c/python27/lib/lib-tk')
sys.path.append('/cygdrive/c/python27/lib/site-packages')

import nltk
import re, pprint
import codecs
sys.stdout = codecs.getwriter('utf_8')(sys.stdout)
sys.stdin = codecs.getwriter('utf_8')(sys.stdin)
reload(sys)
sys.setdefaultencoding('utf-8')
nltk.data.path.append('/cygdribe/c/Users/xxxxxx/AppData/Roaming/nltk_data')
|