Corpus with analized dependency structure (12.1.3)

Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook.

>>> from nltk.corpus.reader.knbc import *
>>> from nltk.corpus.util import LazyCorpusLoader
>>> root = nltk.data.find('corpora/knbc/corpus1')
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
  File "C:\Python27\lib\site-packages\nltk\data.py", line 467, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource 'corpora/knbc/corpus1' not found.  Please use the NLTK
  Downloader to obtain the resource:  &gt;&gt;&gt; nltk.download()
  Searched in:
    - 'C:\\Users\\xxxxxx/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Python27\\nltk_data'
    - 'C:\\Python27\\lib\\nltk_data'
    - 'C:\\Users\\xxxxxx\\AppData\\Roaming\\nltk_data'
**********************************************************************
>>> root = nltk.data.find('knbc/corpus1')

Note: In case you still got error, you can add the path into nltk.data.path.

>>> nltk.data.path.append('/cygdrive/c/Users/xxxxxxx/AppData/Roaming/nltk_data')
>>> root = nltk.data.find('knbc/corpus1')
>>>

The parameter of the path should be adjusted based on the actual file location in your environment.

>>> fileids = [f for f in find_corpus_fileids(FileSystemPathPointer(root), ".*")
...     if re.search(r"\d\-\d\-[\d]+\-[\d]+", f)]

This code to get the list of fileids. The regular expression part should be defined by the naming rule of the files.

>>> def _knbc_fileids_sort(x):
...     cells = x.split('-')
...     return (cells[0], int(cells[1]), int(cells[2]), int(cells[3]))
...
>>> knbc = LazyCorpusLoader('knbc/corpus1', KNBCorpusReader, sorted(fileids,
...     key=_knbc_fileids_sort), encoding='euc-jp')
>>> print knbc.fileids()
....
i_1/KN260_Keitai_1-1-8-03', 'KN260_Kyoto_1/KN260_Kyoto_1-1-1-01', 'KN260_Kyoto_1
/KN260_Kyoto_1-1-2-01', 'KN260_Kyoto_1/KN260_Kyoto_1-1-3-01', 'KN260_Kyoto_1/KN2
60_Kyoto_1-1-4-01', 'KN260_Kyoto_1/KN260_Kyoto_1-1-5-01', 'KN260_Kyoto_1/KN260_K
yoto_1-1-6-01', 'KN260_Kyoto_1/KN260_Kyoto_1-1-7-01', 'KN260_Kyoto_1/KN260_Kyoto
_1-1-8-01']

After sorting display all fileids. Do we really need to go through so many steps to use?

>>> print ' '.join(knbc.words()[:100])
［ 携帯 電話 ］ プリペイド カード 携帯 布教 。 もはや ’ 今さら ’ だ が 、 と
いう 接頭句 で 始める しか ない ほど 今さら だ が 、 私 は プリペイド 携帯 を ず
っと 使って いる 。 犯罪 に 用い られる など に より かなり イメージ を 悪化 さ
せて しまった プリペイド 携帯 だ が 、 一 ユーザー と して は 、 かなり 使いで
が ある 。 かつて は このような 話 を 友人 に 振って も 、 「 携帯 電話 の 料金
は 親 が 払って いる から 別に ．．． 」 と いう にべもない 答え が 返って くる
ばかりだった が
>>>

Access to tree structure. But encoding problem again...

>>> print '\n\n'.join('%s' % tree for tree in knbc.parsed_sents()[0:2])
(蟶・蕗/縲・
  (髮ｻ隧ｱ/・ｽ ・ｻ/謳ｺ蟶ｯ)
  (謳ｺ蟶ｯ (繧ｫ繝ｼ繝・繝励Μ繝壹う繝・))

(菴ｿ縺｣縺ｦ/縺・ｋ/縲・
  (莉翫＆繧・縺&#63728;/縺・縲・
    (縺ｻ縺ｩ
      (蟋九ａ繧・縺励°/縺ｪ縺・
        繧ゅ・繧・
        (謗･鬆ｭ蜿･/縺ｧ (縺・≧ 窶・莉翫＆繧・窶・縺&#63728;/縺・縲・縺ｨ)))))
  遘・縺ｯ
  (謳ｺ蟶ｯ/繧・繝励Μ繝壹う繝・
  縺壹▲縺ｨ)
>>> print '\n\n'.join(u'%s' % tree for tree in knbc.parsed_sents()[0:2])
(布教/。
  (電話/］ ［/携帯)
  (携帯 (カード プリペイド)))

(使って/いる/。
  (今さら/だ/が/、
    (ほど
      (始める/しか/ない
        もはや
        (接頭句/で (いう ’/今さら/’/だ/が/、/と)))))
  私/は
  (携帯/を プリペイド)
  ずっと)
>>>

Note: This was tested in my Windows 7 environment. In Mountain Lion environment, I got a result without problem even not add "u" before '%s".

Display with POS.

>>> print '\n'.join(' '.join("%s/%s" % (w[0], w[1].split(' ')[2]) for w in sent)
...     for sent in knbc.tagged_sents()[0:20])
［/特殊 携帯/名詞 電話/名詞 ］/特殊 プリペイド/名詞 カード/名詞 携帯/名詞 布教/
名詞 。/特殊
もはや/副詞 ’/特殊 今さら/副詞 ’/特殊 だ/判定詞 が/助詞 、/特殊 と/助詞 いう/
動詞 接頭句/名詞 で/助詞 始める/動詞 しか/助詞 ない/形容詞 ほど/名詞 今さら/副詞
 だ/判定詞 が/助詞 、/特殊 私/名詞 は/助詞 プリペイド/名詞 携帯/名詞 を/助詞 ず
っと/副詞 使って/動詞 いる/接尾辞 。/特殊

For me it is still not 100% clear which case encoding problem happens.