Various corpus - Deutschina's Tech Diary

Coming back to the O'Reilly's textbook. (Chapter 2.1.2-2.1.3)

Referring Webtext:

>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
...     print fileid, webtext.raw(fileid)[:65], '...'
... 
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
>>>

First to get a fileid then extract first 65 characters. What kind of conversation between a White guy and an Asian girl in overheard.txt???

This is chatroom conversation. Getting the 124th sentence (as start from 0) from the corpus.

>>> from nltk.corpus import nps_chat
>>> chatroom =nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

By the way, this was not my first time to make a mistake (typo) when typing a term "fileid". I understand this came from "File ID" in my mind, but it looks like "field" if it is written in lower cases, doesn't it?

>>> for field in webtext.fileids():
...     print fileid, webtext.raw(fileid)[:65], '...'
... 
whitman-leaves.txt
Traceback (most recent call last):
  File "&lt;stdin>", line 2, in &lt;module>
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 73, in raw
    return concat([self.open(f, sourced).read() for f in fileids])
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 187, in open
    stream = self._root.join(file).open(encoding)
  File "/Library/Python/2.7/site-packages/nltk/data.py", line 176, in join
    return FileSystemPathPointer(path)
  File "/Library/Python/2.7/site-packages/nltk/data.py", line 154, in __init__
    raise IOError('No such file or directory: %r' % path)
IOError: No such file or directory: '/Users/ken/nltk_data/corpora/webtext/whitman-leaves.txt'

Brown Corpus:

The entire list of category of Brown Corpus is listed here:
http://icame.uib.no/brown/bcm-los.html

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
>>>

fileid "cg22" stands for Kenneth Reiner's "Coping with Runaway Technology" according to the list.

Displaying frequency of specific words in the category.

>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
...     print m + ':', fdist[m],
... 
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
>>> 
>>> modals = ['what', 'why', 'when', 'which', 'who', 'how']
>>> for m in modals:
...     print m + ':', fdist[m],
... 
what: 95 why: 14 when: 169 which: 245 who: 268 how: 42
>>>

By using ConditionalFreqDist, we can compare more easily.

>>> cfd = nltk.ConditionalFreqDist(
...     (genre, word)
...     for genre in brown.categories()
...     for word in brown.words(categories=genre))

>>> 
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
                 can could  may might must will
           news   93   86   66   38   50  389
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13
>>> modals = ['what', 'why', 'when', 'which', 'who', 'how']
>>> cfd.tabulate(conditions=genres, samples=modals)
                what  why when which  who  how
           news   76    9  128  244  268   37
       religion   64   14   53  202  100   23
        hobbies   78   10  119  252  103   40
science_fiction   27    4   21   32   13   12
        romance  121   34  126  104   89   60
          humor   36    9   52   62   48   18
>>>

Both stats are interesting. As mentioned in the textbook "will" is most frequently used in news, on the other hand, "could" is in romance. In news, people (who) is most interested, but no so high interests in science_fiction. Maybe scientists are more interested in objects rather than people?