Vocabulary resources 2 - Deutschina's Tech Diary

Continuing O'Reilly's textbook chapter 2.4.2:

>>> entries = nltk.corpus.cmudict.entries()
>>> len(entries)
133737
>>> for entry in entries[39943:39951]:
...     print entry
... 
('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])
>>>

As len is larger than the textbook, the list has been extended since then. Can I search the same entries using index()?

>>> entries.index('fire')
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
  File "/Library/Python/2.7/site-packages/nltk/util.py", line 704, in index
    raise ValueError('index(x): x not in list')
ValueError: index(x): x not in list
>>>
>>> print('fire' in entries)
False
>>> print('explorer' in entries)
False

Seems special way is necessary to get a necessary index. I searched and found this list style is called "tuples". I got some information from this link.

http://stackoverflow.com/questions/946860/using-pythons-list-index-method-on-a-list-of-tuples-or-objects

>>> [x for x, y in enumerate(entries) if y[0]=='fire']
[42372, 42373]
>>> for entry in entries[42371:42379]:
...     print entry
... 
('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])
>>>

entries is consist of two parts, word and pronunciation. This code to select words with 3 sound elements then print if the first one starts with 'P' and the third one starts with 'T'.

>>> for word, pron in entries:
...     if len(pron) == 3:
...             ph1, ph2, ph3 = pron
...             if ph1 == 'P' and ph3 == 'T':
...                     print word, ph2,
... 
pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1
>>>

Let me change the code a little bit. Just change name in For statement and remove the last comma(,) from print statement.

>>> for x, y in entries:
...     if len(y) == 3:
...             ph1, ph2, ph3 = y
...             if ph1 == 'P' and ph3 == 'T':   
...                     print x, ph2
... 
pait EY1
pat AE1
pate EY1
patt AE1
peart ER1
peat IY1
peet IY1
peete IY1
pert ER1
pet EH1
pete IY1
pett EH1
piet IY1
piette IY1
pit IH1
pitt IH1
pot AA1
pote OW1
pott AA1
pout AW1
puett UW1
purt ER1
put UH1
putt AH1
>>>

I just like to display one element in one line.

This one is to extract words whose pronunciation ends with 'nicks'.

>>> syllable = ['N', 'IH0', 'K', 'S']
>>> [word for word, pron in entries if pron[-4:] == syllable]
["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", "endotronics'", 'endotronics', 'enix', 'environics', 'ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics', 'histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics', 'mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix', 'minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix', 'mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics', "panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix', "philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix', 'plantronics', 'pyrotechnics', 'refuseniks', "resnick's", 'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks', 'technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics', 'unix', "vinick's", "vinnick's", 'vitronics']
>>>

These two examples are also interesting.

>>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']
['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn']
>>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))
['gn', 'kn', 'mn', 'pn']
>>>

The first one is list of words which ends with 'n' but those 'n' are not pronounced and the last sound element is 'M'. The second one is words not pronouncing the first character and the first sound element is 'N'.

This example is to extract words which meets following conditions:

consist of 5 sound elements
The first accent is at the second sound element
The second accent is at the fourth sound element

>>> def stress(pron):
...     return [char for phone in pron for char in phone if char.isdigit()]
... 
>>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']]
['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating', 'accelerator', 'accelerators', 'accentuated', 
...
'uncompensated', 'uncomplicated', 'uneducated', 'unmitigated', 'unnecessary', 'unprecedented', 'unregulated', 'unsanitary', 'unsatisfying', 'unsaturated', 'velociraptor', 'vocabulary', 'voluntarism']
>>>

Slightly change the condition:

>>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']]
['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients', 'academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations', 'accumulation', 'accumulations', 'acetylcholine', 
....
'uenohara', 'undiplomatic', 'uneconomic', 'unpatriotic', 'unrepresented', 'unscientific', 'unsentimental', 'unsympathetic', 'wakabayashi', 'yekaterinburg']

Some Japanese sir names are included. U-e-no-ha-ra, Wa-ka-ba-ya-shi ---- yes, seems correct.

>>> p3 = [(pron[0]+'-'+pron[2], word)
...     for (word, pron) in entries
...     if pron[0] == 'P' and len(pron) == 3]
>>> cfd = nltk.ConditionalFreqDist(p3)
>>> for template in cfd.conditions():
...     if len(cfd[template]) > 10:
...             words = cfd[template].keys()
...             wordlist = ' '.join(words)
...             print template, wordlist[:70] + "..."
... 
P-CH patch pautsch peach perch petsch petsche piche piech pietsch pitch pit...
P-K pac pack paek paik pak pake paque peak peake pech peck peek perc perk ...
P-L pall pahl pail paille pal pale paul paule paull peal peale pearl pearl...
P-N paign pain paine pan pane pawn payne peine pen penh penn pin pine pinn...
P-P paap paape pap pape papp paup peep pep pip pipe pipp poop pop pope pop...
P-R paar pair par pare parr pear peer pier poor poore por pore porr pour...
P-S puss pace pass pasts peace pearse pease perce pers perse pesce piece p...
P-T pait pat pate patt peart peat peet peete pert pet pete pett piet piett...
P-UW1 peru peugh pew plew plue prew pru prue prugh pshew pugh...
P-Z p's p.'s p.s pais paiz pao's pas pause paws pays paz peas pease pei's ...
>>>

Need to understand step by step. p3 is a list of tuples. In the tuple, there are two elements. The first one is called 'template' ---- The first character of the first sound element and the third sound element are connected with '-'. The second element is words whose number of sound elements are 3.

>>> p3
[('P-P', 'paap'), ('P-P', 'paape'), ('P-R', 'paar'), ('P-SH', 'paasch'), ('P-K', 'pac'), ('P-S', 'pace'), ('P-K', 'pack'), ('P-D', 'pad'), ('P-K', 'paek'), ('P-TH', 'paeth'), ('P-F', 'paff'), ('P-JH', 'page'), ('P-L', 'pahl'), 
....
('P-G', 'pug'), ('P-UW1', 'pugh'), ('P-L', 'puhl'), ('P-G', 'puig'), ('P-L', 'pull'), ('P-N', 'pun'), ('P-NG', 'pung'), ('P-P', 'pup'), ('P-JH', 'purge'), ('P-K', 'purk'), ('P-Z', 'purrs'), ('P-S', 'purse'), ('P-T', 'purt'), ('P-S', 'pus'), ('P-SH', 'pusch'), ('P-SH', 'push'), ('P-S', 'puss'), ('P-S', 'puss'), ('P-T', 'put'), ('P-TH', 'puth'), ('P-CH', 'putsch'), ('P-T', 'putt'), ('P-K', 'pyke'), ('P-L', 'pyle'), ('P-M', 'pymm'), ('P-N', 'pyne'), ('P-ER0', 'pyre')]
>>>

After that creating ConditinalFreqDist for p3. Let's double-check inside of the conditions().

>>> cfd.conditions()
['P-AA1', 'P-AH0', 'P-AW1', 'P-AY1', 'P-B', 'P-CH', 'P-D', 'P-ER0', 'P-ER1', 'P-EY1', 'P-F', 'P-G', 'P-IY0', 'P-IY1', 'P-JH', 'P-K', 'P-L', 'P-M', 'P-N', 'P-NG', 'P-OW0', 'P-OW1', 'P-OY1', 'P-P', 'P-R', 'P-S', 'P-SH', 'P-T', 'P-TH', 'P-UW1', 'P-V', 'P-Z']

How about len of each condition()?

>>> for template in cfd.conditions():
...     print template, len(cfd[template])
... 
P-AA1 1
P-AH0 2
P-AW1 4
P-AY1 3
P-B 1
P-CH 17
P-D 7
P-ER0 5
P-ER1 1
P-EY1 3
P-F 4
P-G 10
P-IY0 1
P-IY1 5
P-JH 4
P-K 26
P-L 39
P-M 6
P-N 18
P-NG 5
P-OW0 2
P-OW1 3
P-OY1 1
P-P 17
P-R 14
P-S 18
P-SH 8
P-T 24
P-TH 8
P-UW1 11
P-V 2
P-Z 22

In the example, only conditions whose len() is more than and including 10 are selected.

cfd[template].keys() is a list of words. Let's see one of a template 'P-UW1'.

>>> cfd['P-UW1'].keys()
['peru', 'peugh', 'pew', 'plew', 'plue', 'prew', 'pru', 'prue', 'prugh', 'pshew', 'pugh']
>>> ' '.join(cfd['P-UW1'].keys())
'peru peugh pew plew plue prew pru prue prugh pshew pugh'
>>>

Using Dictionary data structure. It is possible to temporally add to the dictionary if the word is not included in the dictionary.

>>> prondict = nltk.corpus.cmudict.dict()
>>> prondict['fire']
[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]
>>> prondict['blog']
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
KeyError: 'blog'
>>> prondict['blog'] = [['B', 'L', 'AA1', 'G']]
>>> prondict['blog']
[['B', 'L', 'AA1', 'G']]
>>>

This example is to display sound elements of all words in text.

>>> text = ['natural', 'language', 'processing']
>>> [ph for w in text for ph in prondict[w][0]]
['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L', 'AE1', 'NG', 'G', 'W', 'AH0', 'JH', 'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0', 'NG']
>>>

prondict[w][0] was unclear for me, but it should be the first entry to be taken if multiple pronunciations exist for the single word.