More familiar with Python

O'Reilly's "Natural Language Processing with Python" is my main text book for learning NLTK. When I am using my Macbook environment, I refer this textbook.

I am still reading Chapter 1 and always start from this.

>>> import nltk
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Today's sample is sent7 which was automatically created when importing nltk.book.

sent7 is a list object and has following elements.

>>> sent7
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']

Extract elements whose length is less than 4.

>>> [w for w in sent7 if len(w) < 4]

[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']

Extract less and including 4.

>>> [w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']

Length is equal to 4 and not equal to 4.

>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.']

Trying more complex conditions.

>>> sorted([w for w in set(text1) if w.endswith('ableness')])
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
>>> sorted([term for term in set(text4) if 'gnt'in term])
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted([item for item in set(text6) if item.istitle()])
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', ....

The first one is to extract words which end with 'ableness'. The second one is words include 'gnt'. The last one is so called "Istitle", words start with large characters.

Move on to next samples.

isdigit is to check whether element only contains numbers. The next two are combined conditions, contains '-' and 'index', start with large characters and the length is longer than 10.

>>> sorted([item for item in set(sent7) if item.isdigit()])
['29', '61']
>>> sorted([w for w in set(text7) if '-' in w and 'index' in w])
['Stock-index', 'index-arbitrage', 'index-fund', 'index-options', 'index-related', 'stock-index']
>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])
['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish', 'Chedorlaomer', 'Girgashites', 'Hazarmaveth', 'Hazezontamar', 'Ishmeelites', 'Jegarsahadutha', 'Jehovahjireh', 'Kirjatharba', 'Melchizedek', 'Mesopotamia', 'Peradventure', 'Philistines', 'Zaphnathpaaneah']

This condition is to exclude words which contain only small characters.

>>> sorted([w for w in set(text7) if not w.islower()])
['!', '#', '$', '%', '&', "'", "''", "'82", "'86", "'S", '*', '*-1', '*-10', '*-100', '*-101', '*-102', '*-103', '*-104', '*-105', '*-106', '*-107', '*-108',...'Asia', 'Asian', 'Asians', 'Asked', 'Aslacton', 'Assets', 'Assistant', 'Associates', 'Association', 'Assuming', 'Assurance', 'At', 'Atlanta', 'Atlanta-based', 'Atlantic', 'Atsushi', 'Attorney', 'Attorneys', 'Attwood', ...

This should be words which include 'cie' or 'cei' in text2.

>>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
['ancient', 'ceiling', 'conceit', 'conceited', 'conceive', 'conscience', 'conscientious', 'conscientiously', 'deceitful', 'deceive', 'deceived', 'deceiving', 'deficiencies', 'deficiency', 'deficient', 'delicacies', 'excellencies', 'fancied', 'insufficiency', 'insufficient', 'legacies', 'perceive', 'perceived', 'perceiving', 'prescience', 'prophecies', 'receipt', 'receive', 'received', 'receiving', 'society', 'species', 'sufficient', 'sufficiently', 'undeceive', 'undeceiving']

Display words only with upper case (large characters), but it is too loooooong.

>>> [w.upper() for w in text1]
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', '(', 'SUPPLIED', 'BY', 'A', 'LATE', 'CONSUMPTIVE', 'USHER', 'TO', 'A', 'GRAMMAR', 'SCHOOL', ')', 'THE', 'PALE', 'USHER', '--', 'THREADBARE', 'IN', 'COAT', ',', 'HEART', ',',...

So many entries are duplicated. I should use set(text1)?

These are how to count words in the sample data.

>>> len(text1)
260819
>>> len(set(text1))
19317
>>> len(set([word.lower() for word in text1]))
17231
>>> len(set([word.lower() for word in text1 if word.isalpha()]))
16948

len(text1) displays the number of words, duplication is not considered.

len(set(text1)) is unique value of words, duplication is considered and distinct value is displayed.

The third one, len(set([word.lower() for word in text1 if word.isalpha()] converts each word into lower case then extract only with characters(alphabet). This means, for example, "They" and "they" are recognised as same word.

This is the combination of for and if statement.

>>> for xyzzy in sent1:
... if xyzzy.endswith('l'):
... print xyzzy
...
Call
Ismael

>>> for token in sent1:

...     if token.islower():
...         print token, 'is a lowercase word'
...     elif token.istitle():
...         print token, 'is a titlecase word'
...     else:
...         print token, 'is punctuation'
...
Call is a titlecase word
me is a lowercase word
Ismael is a titlecase word
. is punctuation

This part is not so difficult to understand if you have some other program language experience.