Text processing with corpus (12.1.4) - Deutschina's Tech Diary

Get number of words and total length of words.

>>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8')
>>> print len(genpaku.words())
733016
>>>
>>> print sum(len(w) for w in genpaku.words())
1247143
>>> sum(len(w) for w in genpaku.words()) / len(genpaku.words())
1.7013857814836237

>>> genpaku_vocab = set(w for w in genpaku.words() if re.match(
...     ur"^[ぁ-んーァ-ンー\u4e00-\u9FFF]+$", w))
>>>
>>> genpaku_t = Text(genpaku.words())
>>> print ' '.join(sorted(genpaku_vocab)[:20])
ぁ あ あぁ ああ あい あいさつ あいだ あいつ あいにく あいまい あう あえ あえい
あえぎ あえて あえなく あえる あお あおげ あおっ
>>> genpaku_t = Text(genpaku.words())
>>> genpaku_wfd = nltk.FreqDist(genpaku_t)
>>> genpaku_wfd.tabulate(10)
 縲・ 縺ｮ  縺ｯ  縲・ 縺ｫ  繧・ 縺・ 縺ｦ  縺・ 縺ｧ
40153 38046 26823 25088 24699 22836 18040 17770 17044 13144
>>> genpaku_tfd = nltk.FreqDist(t[2] for (w, t) in genpaku.tagged_words())
>>> genpaku_tfd.tabulate(10)
         縲・ 縺ｮ  縺ｯ  縲・ 縺ｫ  繧・ 縺&#63728;  縺・ 縺ｦ
161726 40154 38046 26823 25088 24699 22836 19497 18040 17767

Encoding problem again...

>>> genpaku_t
&lt;Text: 縺ｭ繧薙・繧・繧・縲&#128; 縺ゅ° 縺｡繧・ｓ 繧・縺ｭ繧薙・繧・繧・..>

The source of FreqDist is already like this. I cannot reach any workaround yet.

>>> genpaku_wfd.plot(10)

Japanese text is not displayed correctly with the standard coding.

>>> print ' '.join(set(w for w,t in genpaku.tagged_words()
...     if t[0] == u"コウショウ"))

Nothing returned. Stop here. Change the environment to Mac (Mountain Lion).

>>> genpaku_wfd = nltk.FreqDist(genpaku_t)
>>> genpaku_wfd.tabulate(10)
 、  の  は  。  に  を  た  て  が  で
40153 38046 26823 25088 24699 22836 18040 17770 17044 13144
>>> genpaku_tfd = nltk.FreqDist(t[2] for (w, t) in genpaku.tagged_words())
>>> genpaku_tfd.tabulate(10)
   	 、  の  は  。  に  を  だ  た  て
161726 40154 38046 26823 25088 24699 22836 19497 18040 17767

The result is far from the sample in the textbook. Why?

t[2] for (w, t) in genpaku.tagged_words()

This code should pick up the 3rd character of the 2nd element in genpaku.tagged_words(). Does this really make sense? Let's check the inside of genpaku.tagged_words().

>>> genpaku.tagged_words()[:20]
[(u'\u306d\u3093\u306d\u3093', u'\u30cd\u30f3\u30cd\u30f3\t\u306d\u3093\u306d\u3093\t\u526f\u8a5e-\u4e00\u822c'), 
....

As already guessed, there are 2 elements and second one is relatively long. I found tabs (\t) are inserted in the second element and thought the purpose is to get the 3rd element after split by tabs. I changed the logic as follows:

>>> genpaku_tfd = nltk.FreqDist(t.split('\t')[2] for (w, t) in genpaku.tagged_words())
>>> genpaku_tfd.tabulate(10)
名詞-一般 動詞-自立 助詞-格助詞-一般 助動詞 記号-読点 助詞-係助詞 名詞-サ変接続 助詞-連体化 記号-句点 助詞-接続助詞
88053 74235 71789 67201 40826 33452 31904 30419 26699 25760

This is also same. The sample in the textbook(1st one) is incorrect. I got expected result after revised a little bit.

>>> print ' '.join(set(w for w, t in genpaku.tagged_words() if t[0] == u"コウシ ョウ"))

>>> print ' '.join(set(w for w, t in genpaku.tagged_words() if t.split('\t')[0] == u"コウショウ"))
高尚 公娼 工匠 交渉
>>>

Getting collocations and generate random sentences.

>>> genpaku_t.collocations()
Building collocations list
オープン ソース; コール バック; インター フェイス; Red Hat; Teddy bear; フリー ソフトウェア; science
fiction; Belle Epoque; 多かれ 少なかれ; シェーン ベルク; attribute name; ミドル ウェア;
package com; ソース コード; ミルキー ホワイト; GNU システム; import org; フリー ソフト; GNU
プロジェクト; あちら こちら
>>> genpaku_t.generate()
Building ngram index...
ねんねん や かぜ が ゆらす と 　 ゆりか ご ゆれる え だ 。 私 たち は 口々 に 噂 し た 目 で 観察 し て 、 そちら
の 条件 に関する 、 いかなる 哲学 的 問題 、 つまり は 動脈 内 に ある の なら 、 こんな ろ く で も 、 感情 的
内容 は 、 技術 オンチ の ため に この 句 を 冷静 に 名 を 指定 し ます 。 そして 前 の われわれ の 権利 を 要求
する もの です 。 東方 の 賢者 は 高価 な 贈り物 を する に とどめよ う 。 労働 こそ
>>>

Maybe this corpus is not only consist of Genpaku Sugita (Japanese famous medical scientist in later 18th to beginning of 19th century but also other new texts. Otherwise why some very new words like "open sources" can be seen??

The last of this article: find similar words.

>>> genpaku_t.similar(u"ソフトウェア")
Building word-context index...
ソフト 人 彼 私 それ 彼女 僕 人間 労働 彼ら 男 これ ぼく わたし 仕事 目 自分 あなた 世界 今

As described in the textbook, this might not be so useful for Japanese language.