Japanese corpus (12.1.1)
>>> import nltk >>> from nltk.corpus.reader import * >>> from nltk.corpus.reader.util import * >>> from nltk.text import Text >>> >>> jp_sent_tokenizer = nltk.RegexpTokenizer(u' 「」!?。]*[!?。]') >>> jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([ぁ-んー]+|[ァ-ンー]+|[\u4e00-\u9FFF]+|[ぁ-んァ-ンー\u4e00-\u9FFF]+)') >>> ginga = PlaintextCorpusReader("c:\Users\i006766\Desktop", r'yoru.txt', ... encoding='utf-8', ... para_block_reader=read_line_block, ... sent_tokenizer=jp_sent_tokenizer, ... word_tokenizer=jp_chartype_tokenizer) >>> print ginga.raw() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\nltk\corpus\reader\plaintext.py", line 73, in raw return concat([self.open(f, sourced).read() for f in fileids]) File "C:\Python27\lib\site-packages\nltk\corpus\reader\util.py", line 421, in concat raise ValueError('concat() expects at least one object!') ValueError: concat() expects at least one object! >>>
The error message seems the file is blank. I guess the location of the file should be the reason of the problem.
>>> jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^ 「」!?。]*[!?。]') >>> ginga = PlaintextCorpusReader("C:/Users/xxxxxxx/AppData/Roaming/nltk_data/corpora", r'yoru.txt', ... encoding='utf-8', ... para_block_reader=read_line_block, ... sent_tokenizer=jp_sent_tokenizer, ... word_tokenizer=jp_chartype_tokenizer) >>>
Try again.
>>> print ginga.raw() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\nltk\corpus\reader\plaintext.py", line 73, in raw return concat([self.open(f, sourced).read() for f in fileids]) File "C:\Python27\lib\site-packages\nltk\data.py", line 848, in read chars = self._read(size) File "C:\Python27\lib\site-packages\nltk\data.py", line 1110, in _read chars, bytes_decoded = self._incr_decode(bytes) File "C:\Python27\lib\site-packages\nltk\data.py", line 1140, in _incr_decode return self.decode(bytes, 'strict') File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 0: invalid start byte >>>
Then error again... UnicodeDecodeError??
OMG! The text file was saved with encode SHIFT-JIS. Convert to UTF-8 then try again.
>>> ginga = PlaintextCorpusReader("C:/Users/i006766/AppData/Roaming/nltk_data/corpora", r'yoru.txt', ... encoding='utf-8', ... para_block_reader=read_line_block, ... sent_tokenizer=jp_sent_tokenizer, ... word_tokenizer=jp_chartype_tokenizer) >>> print ginga.raw()[0:200] 宮沢賢治 銀河鉄道の夜 銀河鉄道の夜 宮沢賢治 -目次 - +一、午后(ごご)の授業 +二、活版所 +三、家 +四、ケンタウル祭の夜 +五、天気輪(てんきりん)の柱 +六、銀河ステーション +七、北十字とプリオシン海岸 +八、鳥を捕(と)る人 +九、ジョバンニの切符(きっぷ) 一、午后(ごご)の
Finally I succeeded. Try samples.
>>> print '/'.join(ginga.words()[105:155]) /一/、/午后/(/ごご/)/の/授業/ / /「/ではみなさんは/、/そういうふうに/川/だと/云/(/い/)/われたり/、/乳/の/流/れたあとだと/云/われたりしていたこのぼんやりと/白/いものがほんとうは/何/かご/承知/ですか/。」/先生/は/、/黒板/に/吊/(/つる/)/した/大/きな/黒/い >>> ginga_t = Text(w.encode('utf-8') for w in ginga.words()) >>> ginga_t.concordance("川") Building index... No matches
Nothing found.
[code langauge="python"]
>>> ginga_t
<Text: 螳ョ豐「雉「豐サ 驫豐ウ驩・% 縺ョ 螟・
驫豐ウ驩・%...>
|