Non-simplified tags (5.2.7-) - Deutschina's Tech Diary

Analysing further detail in Nouns. This program shows top5 words in each type.

>>> def findtags(tag_prefix, tagged_text):
...     cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
...     return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())
... 

>>> for tag in sorted(tagdict):
...     print tag, tagdict[tag]
... 
NN ['year', 'time', 'state', 'week', 'home']
NN$ ["year's", "world's", "state's", "city's", "company's"]
NN$-HL ["Golf's", "Navy's"]
NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"]
NN-HL ['Question', 'Salary', 'business', 'condition', 'cut']
NN-NC ['aya', 'eva', 'ova']
NN-TL ['President', 'House', 'State', 'University', 'City']
NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner']
NNS ['years', 'members', 'people', 'sales', 'men']
NNS$ ["children's", "women's", "janitors'", "men's", "builders'"]
NNS$-HL ["Dealers'", "Idols'"]
NNS$-TL ["Women's", "States'", "Giants'", "Bombers'", "Braves'"]
NNS-HL ['$12,500', '$14', '$37', 'A135', 'Arms']
NNS-TL ['States', 'Nations', 'Masters', 'Bears', 'Communists']
NNS-TL-HL ['Nations']

To check words coming after 'often'.

>>> brown_learned_text = brown.words(categories='learned')
>>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))
[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming', 'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', 'classified', 'colorful', 'composed', 'contain', 'differed', 'difficult', 'encountered', 'enough', 'equate', 'extremely', 'found', 'happens', 'have', 'ignored', 'in', 'involved', 'more', 'needed', 'nightly', 'observed', 'of', 'on', 'out', 'quite', 'represent', 'responsible', 'revamped', 'seclude', 'set', 'shortened', 'sing', 'sounded', 'stated', 'still', 'sung', 'supported', 'than', 'to', 'when', 'work']

To get tags of words after 'often' then got a stat.

>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True)
>>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often']
>>> fd = nltk.FreqDist(tags)
>>> fd.tabulate()
  VN    V   VD  ADJ  DET  ADV    P    ,  CNJ    .   TO  VBZ   VG   WH
  15   12    8    5    5    4    4    3    3    1    1    1    1    1
>>>

Find patterns "verb + 'to' + verb".

>>> def process(sentence):
...     for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence):
...             if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
...                     print w1, w2, w3
... 
>>> for tagged_sent in brown.tagged_sents():
...     process(tagged_sent)
... 
combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
....

When I was learning English at school, it was tricky to distinguish "verb + 'to' + verb" and "verb + ---ing". Now I can get a list of "verb + ---ing" like this.

>>> def process2(sentence):
...     for (w1, t1), (w2, t2) in nltk.bigrams(sentence):
...             if (t1.startswith('V') and w2.endswith('ing')):
...                     print w1, w2
... 
>>> for tagged_sent in brown.tagged_sents():
...     process2(tagged_sent)
... 
provide enabling
pass enabling
give planning
spent providing
spent learning
improve nursing
propose increasing
....

The last one is to display words which are used as many type of POS.

>>> for word in data.conditions():
...     if len(data[word]) > 3:
...             tags = data[word].keys()
...             print word, ' '.join(tags)
... 
best ADJ ADV NP V
better ADJ ADV V DET
close ADV ADJ V N
cut V N VN VD
even ADV DET ADJ V
hit V VD VN N
lay ADJ V NP VD
left VD ADJ N VN
like CNJ V ADJ P
near P ADJ ADV DET
open ADJ V N ADV
past N ADJ DET P
present ADJ ADV N V
read V VN VD NP
right ADJ N DET ADV
second NUM ADV DET N
set VN V VD N

Picking up words who have more than 3 tags.