Access to text corpus - Deutschina's Tech Diary

Now start reading Chapter 2.1.1.

[code language="python"]

    
        >>> nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')

>>> len(emma)

192427

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

>>> emma.concorance("surprize")

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: 'Text' object has no attribute 'concorance'

[/code]

What's wrong? Ooops, it was just a typo...

[code language="python"]
>>> emma.concordance("surprize")
Building index...
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity `
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ;
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
expected by the best judges , for surprize -- but there was great joy . Mr .
sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
. It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

[/code]

Another way to import.

[code language="language="python'"]

    
        >>> gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

    
        [/code]

Get various information from the texts.

[code language="python"]
>>> for fileid in gutenberg.fileids():

... num_chars = len(gutenberg.raw(fileid))

... num_words = len(gutenberg.words(fileid))

... num_sents = len(gutenberg.sents(fileid))

... num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))

... print int(num_chars / num_words), int(num_words / num_sents), int(num_words / num_vocab), fileid

...

4 21 26 austen-emma.txt

4 23 16 austen-persuasion.txt

4 23 22 austen-sense.txt

4 33 79 bible-kjv.txt

4 18 5 blake-poems.txt

4 17 14 bryant-stories.txt

4 17 12 burgess-busterbrown.txt

4 16 12 carroll-alice.txt

4 17 11 chesterton-ball.txt

4 19 11 chesterton-brown.txt

4 16 10 chesterton-thursday.txt

4 17 24 edgeworth-parents.txt

4 24 15 melville-moby_dick.txt

4 52 10 milton-paradise.txt

4 11 8 shakespeare-caesar.txt

4 12 7 shakespeare-hamlet.txt

4 12 6 shakespeare-macbeth.txt

4 35 12 whitman-leaves.txt

    
        [/code]

Need some explanations. The first column of the output is calculated as:

Number of characters / Number of words

Therefore the value is average of word length. Be noted spaces between words are also included in Number of chars. We should reduce 1 from the value. (Average length is 3, actually) The next one is:

Number of words / Number of sentences

Yes, the average number of words in sentences. The last one:

Number of words / Number of vocabularies

This stands for how many times each word is used in the text.

Raw() is to access "raw data" of file contents instead of splitting to tokens.

The following example is to get sentences.

[code language="python"]
>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

>>> macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

>>> macbeth_sentences[1037]

['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']

>>> macbeth_sentences[2219]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 264, in __getitem__

raise IndexError('index out of range')

IndexError: index out of range

>>> len(macbeth_sentences)

1907

>>> macbeth_sentences[1587]

['who', 'knowes', 'it', ',', 'when', 'none', 'can', 'call', 'our', 'powre', 'to', 'accompt', ':', 'yet', 'who', 'would', 'haue', 'thought', 'the', 'olde', 'man', 'to', 'haue', 'had', 'so', 'much', 'blood', 'in', 'him']

    
        >>> longest_len = max([len(s) for s in macbeth_sentences])

>>> [s for s in macbeth_sentences if len(s) == longest_len]

'Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements'

    
        [/code]

Like this, a sentence is displayed as a list[]. The last code is to get the longest sentence in the text.