Japanese corpus (12.1.1) - Deutschina's Tech Diary

>>> import nltk
>>> from nltk.corpus.reader import *
>>> from nltk.corpus.reader.util import *
>>> from nltk.text import Text
>>>
>>> jp_sent_tokenizer = nltk.RegexpTokenizer(u'　「」！？。]*[！？。]')
>>> jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([ぁ-んー]+|[ァ-ンー]+|[\u4e00-\u9FFF]+|[ぁ-んァ-ンー\u4e00-\u9FFF]+)')
>>> ginga = PlaintextCorpusReader("c:\Users\i006766\Desktop", r'yoru.txt',
...                             encoding='utf-8',
...                             para_block_reader=read_line_block,
...                             sent_tokenizer=jp_sent_tokenizer,
...                             word_tokenizer=jp_chartype_tokenizer)
>>> print ginga.raw()
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\plaintext.py", line 73,
 in raw
    return concat([self.open(f, sourced).read() for f in fileids])
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\util.py", line 421, in
concat
    raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
>>>

The error message seems the file is blank. I guess the location of the file should be the reason of the problem.

>>> jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^　「」！？。]*[！？。]')
>>> ginga = PlaintextCorpusReader("C:/Users/xxxxxxx/AppData/Roaming/nltk_data/corpora", r'yoru.txt',
...                             encoding='utf-8',
...                             para_block_reader=read_line_block,
...                             sent_tokenizer=jp_sent_tokenizer,
...                             word_tokenizer=jp_chartype_tokenizer)
>>>

Try again.

>>> print ginga.raw()
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\plaintext.py", line 73,
 in raw
    return concat([self.open(f, sourced).read() for f in fileids])
  File "C:\Python27\lib\site-packages\nltk\data.py", line 848, in read
    chars = self._read(size)
  File "C:\Python27\lib\site-packages\nltk\data.py", line 1110, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Python27\lib\site-packages\nltk\data.py", line 1140, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 0: invalid start byte
>>>

Then error again... UnicodeDecodeError??

OMG! The text file was saved with encode SHIFT-JIS. Convert to UTF-8 then try again.

>>> ginga = PlaintextCorpusReader("C:/Users/i006766/AppData/Roaming/nltk_data/corpora", r'yoru.txt',
...                             encoding='utf-8',
...                             para_block_reader=read_line_block,
...                             sent_tokenizer=jp_sent_tokenizer,
...                             word_tokenizer=jp_chartype_tokenizer)
>>> print ginga.raw()[0:200]
宮沢賢治 銀河鉄道の夜

銀河鉄道の夜

宮沢賢治



-目次
-
+一、午后（ごご）の授業

+二、活版所

+三、家

+四、ケンタウル祭の夜

+五、天気輪（てんきりん）の柱

+六、銀河ステーション

+七、北十字とプリオシン海岸

+八、鳥を捕（と）る人

+九、ジョバンニの切符（きっぷ）




一、午后（ごご）の

Finally I succeeded. Try samples.

>>> print '/'.join(ginga.words()[105:155])

/一/、/午后/（/ごご/）/の/授業/
/
/「/ではみなさんは/、/そういうふうに/川/だと/云/（/い/）/われたり/、/乳/の/流/れたあとだと/云/われたりしていたこのぼんやりと/白/いものがほんとうは/何/かご/承知/ですか/。」/先生/は/、/黒板/に/吊/（/つる/）/した/大/きな/黒/い

>>> ginga_t = Text(w.encode('utf-8') for w in ginga.words())
>>> ginga_t.concordance("川")
Building index...
No matches

Nothing found.

[code langauge="python"]
>>> ginga_t
<Text: 螳ｮ豐｢雉｢豐ｻ驫€豐ｳ驩・％縺ｮ螟・

驫€豐ｳ驩・％...>
|