Accessing to text source (3.1.1) - Deutschina's Tech Diary

Now start Chapter 3 of the whale book.

Let's import "Crime and Punishment" from the Gutenberg Ebook.

>>> from __future__ import division
>>> import nltk, re, pprint
>>> 
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
&lt;type 'str'>
>>> len(raw)
1176893
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
>>>

The last code to extract first 75 chars (or bytes) of the raw text.

Note: In case of proxy-required environment, need to set a proxy.

>>> proxies = {'http': 'http://www.someproxy.com:3128'}
>>> raw = urlopen(url, proxies=proxies).read()

Tokenize to split the raw text into a list.

>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
&lt;type 'list'>
>>> len(tokens)
244484
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
>>>

Generating a text in NLTK then play something.

>>> text = nltk.Text(tokens)
>>> type(text)
&lt;class 'nltk.text.Text'>
>>> text[1020:1060]
['had', 'successfully', 'avoided', 'meeting', 'his', 'landlady', 'on', 'the', 'staircase.', 'His', 'garret', 'was', 'under', 'the', 'roof', 'of', 'a', 'high', ',', 'five-storied', 'house', 'and', 'was', 'more', 'like', 'a', 'cupboard', 'than', 'a', 'room.', 'The', 'landlady', 'who', 'provided', 'him', 'with', 'garret', ',', 'dinners', ',']

The place is different from the whale book. Try to find the same location.

>>> text.index('CHAPTER')
982
>>> text[982:1022]
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge.', 'He', 'had', 'successfully']
>>>

This difference could come from that how to be tokenise. In the textbook, dots(.) at the end of sentences were defined as independent tokens. On the other hand, dots are combined into the last word of sentence in my example. (e.g. "bridge.") This is the reason why three more words are included in my case as 3 dots are combined into the each last word.

Need to check further detail of tokenise (or textnization) later.

How about collocations()?

>>> text.collocations()
Building collocations list
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Project Gutenberg; Andrey Semyonovitch; Nikodim Fomitch;
young man; Dmitri Prokofitch; n't know; Ilya Petrovitch; Good heavens
>>>

find() and rfind()

>>> raw.find("PART I")
5338
>>> raw.rfind("End of Project Gutenberg's Crime")
1157743
>>> raw = raw[5338:1157681]
>>> raw.find("PART I")
0

This technique can be used to "cut off" unnecessary information from the source.