Various corpus
Coming back to the O'Reilly's textbook. (Chapter 2.1.2-2.1.3)
Referring Webtext:
>>> from nltk.corpus import webtext >>> for fileid in webtext.fileids(): ... print fileid, webtext.raw(fileid)[:65], '...' ... firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ... grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop ... overheard.txt White guy: So, do you have any plans for this evening? Asian girl ... pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ... singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ... wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... >>>
First to get a fileid then extract first 65 characters. What kind of conversation between a White guy and an Asian girl in overheard.txt???
This is chatroom conversation. Getting the 124th sentence (as start from 0) from the corpus.
>>> from nltk.corpus import nps_chat >>> chatroom =nps_chat.posts('10-19-20s_706posts.xml') >>> chatroom[123] ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
By the way, this was not my first time to make a mistake (typo) when typing a term "fileid". I understand this came from "File ID" in my mind, but it looks like "field" if it is written in lower cases, doesn't it?
>>> for field in webtext.fileids(): ... print fileid, webtext.raw(fileid)[:65], '...' ... whitman-leaves.txt Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 73, in raw return concat([self.open(f, sourced).read() for f in fileids]) File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 187, in open stream = self._root.join(file).open(encoding) File "/Library/Python/2.7/site-packages/nltk/data.py", line 176, in join return FileSystemPathPointer(path) File "/Library/Python/2.7/site-packages/nltk/data.py", line 154, in __init__ raise IOError('No such file or directory: %r' % path) IOError: No such file or directory: '/Users/ken/nltk_data/corpora/webtext/whitman-leaves.txt'
Brown Corpus:
The entire list of category of Brown Corpus is listed here:
http://icame.uib.no/brown/bcm-los.html
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] >>>
fileid "cg22" stands for Kenneth Reiner's "Coping with Runaway Technology" according to the list.
Displaying frequency of specific words in the category.
>>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals: ... print m + ':', fdist[m], ... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 >>> >>> modals = ['what', 'why', 'when', 'which', 'who', 'how'] >>> for m in modals: ... print m + ':', fdist[m], ... what: 95 why: 14 when: 169 which: 245 who: 268 how: 42 >>>
By using ConditionalFreqDist, we can compare more easily.
>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) >>> >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals) can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 >>> modals = ['what', 'why', 'when', 'which', 'who', 'how'] >>> cfd.tabulate(conditions=genres, samples=modals) what why when which who how news 76 9 128 244 268 37 religion 64 14 53 202 100 23 hobbies 78 10 119 252 103 40 science_fiction 27 4 21 32 13 12 romance 121 34 126 104 89 60 humor 36 9 52 62 48 18 >>>
Both stats are interesting. As mentioned in the textbook "will" is most frequently used in news, on the other hand, "could" is in romance. In news, people (who) is most interested, but no so high interests in science_fiction. Maybe scientists are more interested in objects rather than people?