Exercise: Chapter 2 (1-7)

Although it took a long time, now I reached to the end of Chapter 2 in the whale book.

1.

>>> words1 = ['green', 'yellow', 'red', 'white', 'black']
>>> words2 = ['pink', 'brown']
>>> words3 = words1 + words2
>>> words3
['green', 'yellow', 'red', 'white', 'black', 'pink', 'brown']
>>> words2 * 2
['pink', 'brown', 'pink', 'brown']
>>> words3[2]
'red'
>>> words3[:2]
['green', 'yellow']
>>> words3[-2]
'pink'
>>> words3[-2:]
['pink', 'brown']
>>> words3[2:4]
['red', 'white']
>>> ' '.join(words1)
'green yellow red white black'
>>> sorted(words3)
['black', 'brown', 'green', 'pink', 'red', 'white', 'yellow']

2.

austen-persuasion.txt is under gutenberg.

>>> len(nltk.corpus.gutenberg.words('austen-persuasion.txt'))
98171
>>> ap = nltk.corpus.gutenberg.words('austen-persuasion.txt')
>>> len(ap)
98171
>>> len(set(ap))
6132
>>> 

Ans: austen-persuasion.txt is consist of 98,171 words. Number of unique words is 6,132.

3.

>>> bc = nltk.corpus.brown.words()
>>> wt = nltk.corpus.webtext.words()
>>> bc
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> wt
['Cookie', 'Manager', ':', '"', 'Don', "'", 't', ...]
>>> len(set(bc))
56057
>>> len(set(wt))
21537
>>> fdistbc = nltk.FreqDist([w.lower() for w in bc])
>>> fdistwt = nltk.FreqDist([w.lower() for w in wt])
>>> modals = ['what', 'why', 'when', 'which', 'who', 'how']
>>> for m in modals:
...     print m + ':', fdistbc[m]
... 
what: 1908
why: 404
when: 2331
which: 3561
who: 2252
how: 834
>>> for m in modals:
...     print m + ':', fdistwt[m]
... 
what: 1362
why: 400
when: 1833
which: 134
who: 404
how: 461
>>> fdistbc
<FreqDist with 49815 samples and 1161192 outcomes>
>>> fdistwt
<FreqDist with 17414 samples and 396736 outcomes>
>>> fdistbc.max()
'the'
>>> fdistbc['the']
69971
>>> fdistbc.freq('the')
0.06025790739171472
>>> fdistwt.max()
'.'

4.

>>> from nltk.corpus import state_union
>>> state_union.fileids()
['1945-Truman.txt', '1946-Truman.txt', '1947-Truman.txt', '1948-Truman.txt', '1949-Truman.txt', '1950-Truman.txt', '1951-Truman.txt', '1953-Eisenhower.txt', '1954-Eisenhower.txt', '1955-Eisenhower.txt', '1956-Eisenhower.txt', '1957-Eisenhower.txt', '1958-Eisenhower.txt', '1959-Eisenhower.txt', '1960-Eisenhower.txt', '1961-Kennedy.txt', '1962-Kennedy.txt', '1963-Johnson.txt', '1963-Kennedy.txt', '1964-Johnson.txt', '1965-Johnson-1.txt', '1965-Johnson-2.txt', '1966-Johnson.txt', '1967-Johnson.txt', '1968-Johnson.txt', '1969-Johnson.txt', '1970-Nixon.txt', '1971-Nixon.txt', '1972-Nixon.txt', '1973-Nixon.txt', '1974-Nixon.txt', '1975-Ford.txt', '1976-Ford.txt', '1977-Ford.txt', '1978-Carter.txt', '1979-Carter.txt', '1980-Carter.txt', '1981-Reagan.txt', '1982-Reagan.txt', '1983-Reagan.txt', '1984-Reagan.txt', '1985-Reagan.txt', '1986-Reagan.txt', '1987-Reagan.txt', '1988-Reagan.txt', '1989-Bush.txt', '1990-Bush.txt', '1991-Bush-1.txt', '1991-Bush-2.txt', '1992-Bush.txt', '1993-Clinton.txt', '1994-Clinton.txt', '1995-Clinton.txt', '1996-Clinton.txt', '1997-Clinton.txt', '1998-Clinton.txt', '1999-Clinton.txt', '2000-Clinton.txt', '2001-GWBush-1.txt', '2001-GWBush-2.txt', '2002-GWBush.txt', '2003-GWBush.txt', '2004-GWBush.txt', '2005-GWBush.txt', '2006-GWBush.txt']
>>> 

It seems that first 4 digits stand for year.

>>> [fileid[:4] for fileid in state_union.fileids()]
['1945', '1946', '1947', '1948', '1949', '1950', '1951', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1963', '1964', '1965', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2001', '2002', '2003', '2004', '2005', '2006']
>>> 

It's time to use ConditionalFreqDist().

>>> cfd = nltk.ConditionalFreqDist(
...     (target, fileid[:4])
...     for fileid in state_union.fileids()
...     for w in state_union.words(fileid)
...     for target in ['men', 'women', 'people']
...     if w.lower() == target)
>>> cfd.tabulate()
       1945 1946 1947 1948 1949 1950 1951 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
   men    2   12    7    5    2    6    8    3    2    4    2    5    2    4    2    6    6    8    3   19   12   11    4    5    2    1    1    1    0    0    3    2    0    0    1    1    1    3    3    1    2    1    1    2    3    9    4    1    1    1    2    1    2    2    5    4    3    6    7    8    7
people   10   49   12   22   15   15   10   17   15   26   30   11   19   11   10   10   10   15    3   30   35   25   17    6   23   34    9   10   22   14   18   19   26   15   12   11   17   19   27   12   14   24   17   13    9   27   27   45   66   73   43   31   22   22   41   27   14   33   21   18   22
 women    2    7    2    1    1    2    2    0    0    0    2    2    1    1    0    0    2    5    1    3    1    1    0    2    0    0    0    0    0    0    1    1    1    1    2    1    2    7    5    1    2    0    0    3    2    9    4    2    1    3    3    2    2    3    7    6    6    4    8   11    7
>>> 

Yes, I can count the numbers but prefer to a more graphical way.

>>> cfd.plot()

figure_1

What can I say from the results?

  • 'People' is intensively used in mid of 1990's (1994, 1995).
  • 'women' is more frequently used since 1978.

5.

>>> from nltk.corpus import wordnet as wn
>>> wn.synset('computer.n.01').part_meronyms()
[Synset('computer_accessory.n.01'), Synset('data_converter.n.01'), Synset('hardware.n.03'), Synset('keyboard.n.01'), Synset('chip.n.07'), Synset('memory.n.04'), Synset('disk_cache.n.01'), Synset('diskette.n.01'), Synset('busbar.n.01'), Synset('computer_circuit.n.01'), Synset('monitor.n.04'), Synset('central_processing_unit.n.01'), Synset('peripheral.n.01'), Synset('cathode-ray_tube.n.01')]
>>> wn.synset('computer.n.01').substance_meronyms()
[]
>>> wn.synset('computer.n.01').member_holonyms()
[]
>>> wn.synset('laptop.n.01')
Synset('laptop.n.01')
>>> wn.synset('laptop.n.01').definition
'a portable computer small enough to use in your lap'
>>> wn.synset('laptop.n.01').part_meronyms()
[]
>>> wn.synset('fish.n.01').definition
'any of various mostly cold-blooded aquatic vertebrates usually having scales and breathing through gills'
>>> wn.synset('fish.n.01').part_meronyms()
[Synset('tail_fin.n.03'), Synset('milt.n.02'), Synset('fin.n.06'), Synset('fishbone.n.01'), Synset('roe.n.02'), Synset('fish_scale.n.01'), Synset('lateral_line.n.01')]
>>> wn.synset('fish.n.01').substance_meronyms()
[]
>>> wn.synset('picture.n.01').definition
'a visual representation (of an object or scene or person or abstraction) produced on a surface'
>>> wn.synset('picture.n.01').member_meronyms()
[]
>>> wn.synset('picture.n.01').part_meronyms()
[]
>>> wn.synset('picture.n.01').substance_meronyms()
[]
>>> wn.synset('picture.n.01').member_holonyms()
[]
>>> wn.synset('picture.n.01').part_holonyms()
[]
>>> wn.synset('picture.n.01').substance_holonyms()
[]
>>> 

I cannot find a good example...

6.

>>> from nltk.corpus import swadesh
>>> de2en = swadesh.entries(['de', 'en'])
>>> it2en = swadesh.entries(['it', 'en'])
>>> translate2 = dict(de2en)
>>> translate2.update(dict(it2en))
>>> len(translate2)
411
>>> translate2['bianco']
'white'
>>> translate2['Hund']
'dog'
>>> 

Possible problem could be 'de' should have higher priority if same words (and different meaning in English) exist both in 'de' and 'it'. Maybe should have separate dictionary??? There must be better answers for this question, but lack of my idea...

Try some more.

>>> itonly = swadesh.entries(['it'])
>>> sorted(itonly)
[('a',), ('acqua',), ('aguzzo, affilato',), ('ala',), ('albero',), ('alcuni',), ('altro',), ('animale',),
.... 
('uomo',), ('uomo',), ('uovo',), ('vecchio',), ('vedere',), ('venire',), ('vento',), ('verde',), ('verme',), ('vicino',), ('vivere',), ('voi',), ('volare',), ('vomitare',)]
>>> translate['uomo']
'man (human being)'
>>> translate['uomo'][1]
'a'

I found some words are duplicated in the dictionary. How to get the second one? The above example is not the right way.

7.

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text1.concordance('however')
Building index...

Displaying 25 of 95 matches:
gledy - piggledy whale statements , however authentic , in these extracts , for
lave ? Tell me that . Well , then , however the old sea - captains may order me
ea - captains may order me about -- however they may thump and punch me about ,
 needs be the sign of " The Trap ." However , I picked myself up and hearing a 
 the conclusion that such an idea , however wild , might not be altogether unwa
 most obstreperously . I observed , however , that one of them held somewhat al
ade on the sea . In a few minutes , however , he was missed by his shipmates , 
bag ' s mouth . This accomplished , however , he turned round -- when , good he
te man into a purplish yellow one . However , I had never been in the South Sea
tle in the matter of my bedfellow . However , a good laugh is a mighty good thi
ight of the water it had absorbed . However , hat and coat and overshoes were o
pulpit , it had not escaped me that however convenient for a ship , these joint
lf baptized again . For the nonce , however , he proposed to sail about , and s
 own and comrade ' s bill ; using , however , my comrade ' s money . The grinni
in to say it was on the starboard . However , by dint of beating about a little
a supper for us both on one clam ?" However , a warm savory steam from the kitc
 owners till all is ready for sea . However , it is always as well to have a lo
fectly as he was known to me then . However , my thoughts were at length carrie
 I got down our traps , resolving , however , to sleep ashore till the last . B
 em !" " No need of profane words , however great the hurry , Peleg ," said Bil
a pilot . I was comforting myself , however , with the thought that in pious Bi
isely -- who knows ? Certain I am , however , that a king ' s head is solemnly 
o scientific description . As yet , however , the sperm whale , scientific or p
IZONTAL TAIL . There you have him . However contracted , that definition is the
several varieties , most of which , however , are little known . Broad - nosed 
>>> 
>>> text2.concordance('however')
Displaying 25 of 155 matches:
hters . He meant not to be unkind , however , and , as a mark of his affection 
e condition of visitors . As such , however , they were treated by her with qui
le ." His wife hesitated a little , however , in giving her consent to this pla
urned Mrs . John Dashwood . " But , however , ONE thing must be considered . Wh
 can ever afford to live in . But , however , so it is . Your father thought on
ce inquiry or remark . Conversation however was not wanted , for Sir John was v
sary to the happiness of both ; for however dissimilar in temper and outward be
al engagements at home and abroad , however , supplied all the deficiencies of 
s silent and grave . His appearance however was not unpleasing , in spite of hi
n their own house . One consolation however remained for them , to which the ex
 in the country ? That is good news however ; I will ride over tomorrow , and a
 ever so rich . I am glad to find , however , from what you say , that he is a 
t to the excellence of such works , however disregarded before . Their taste wa
ly excited by her sister ; and that however a general resemblance of dispositio
d Marianne . " Do not boast of it , however ," said Elinor , " for it is injust
t will be any satisfaction to you , however , to be told , that I believe his c
wo wives , I know not . A few years however will settle her opinions on the rea
n his side impossible . His concern however was very apparent ; and after expre
d her husband and mother . The idea however started by her , was immediately pu
 are determined on anything . But , however , I hope you will think better of i
 I can guess what his business is , however ," said Mrs . Jennings exultingly .
o unfortunate an event ; concluding however by observing , that as they were al
r . Willoughby ." " Mr . Willoughby however is the only person who can have a r
sed in him . There is great truth , however , in what you have now urged of the
iced by him ." " Do not blame him , however , for departing from his character 

It looks like most of case, "However" is not located at the begging of the sentences. Even thought it is at the beginning of the sentence, the meaning is something like 'but' or 'although'.