Non-simplified tags (5.2.7-)
Analysing further detail in Nouns. This program shows top5 words in each type.
>>> def findtags(tag_prefix, tagged_text): ... cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) ... return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) ... >>> for tag in sorted(tagdict): ... print tag, tagdict[tag] ... NN ['year', 'time', 'state', 'week', 'home'] NN$ ["year's", "world's", "state's", "city's", "company's"] NN$-HL ["Golf's", "Navy's"] NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] NN-NC ['aya', 'eva', 'ova'] NN-TL ['President', 'House', 'State', 'University', 'City'] NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] NNS ['years', 'members', 'people', 'sales', 'men'] NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] NNS$-HL ["Dealers'", "Idols'"] NNS$-TL ["Women's", "States'", "Giants'", "Bombers'", "Braves'"] NNS-HL ['$12,500', '$14', '$37', 'A135', 'Arms'] NNS-TL ['States', 'Nations', 'Masters', 'Bears', 'Communists'] NNS-TL-HL ['Nations']
To check words coming after 'often'.
>>> brown_learned_text = brown.words(categories='learned') >>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often')) [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming', 'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', 'classified', 'colorful', 'composed', 'contain', 'differed', 'difficult', 'encountered', 'enough', 'equate', 'extremely', 'found', 'happens', 'have', 'ignored', 'in', 'involved', 'more', 'needed', 'nightly', 'observed', 'of', 'on', 'out', 'quite', 'represent', 'responsible', 'revamped', 'seclude', 'set', 'shortened', 'sing', 'sounded', 'stated', 'still', 'sung', 'supported', 'than', 'to', 'when', 'work']
To get tags of words after 'often' then got a stat.
>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) >>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] >>> fd = nltk.FreqDist(tags) >>> fd.tabulate() VN V VD ADJ DET ADV P , CNJ . TO VBZ VG WH 15 12 8 5 5 4 4 3 3 1 1 1 1 1 >>>
Find patterns "verb + 'to' + verb".
>>> def process(sentence): ... for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence): ... if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')): ... print w1, w2, w3 ... >>> for tagged_sent in brown.tagged_sents(): ... process(tagged_sent) ... combined to achieve continue to place serve to protect wanted to wait allowed to place expected to become ....
When I was learning English at school, it was tricky to distinguish "verb + 'to' + verb" and "verb + ---ing". Now I can get a list of "verb + ---ing" like this.
>>> def process2(sentence): ... for (w1, t1), (w2, t2) in nltk.bigrams(sentence): ... if (t1.startswith('V') and w2.endswith('ing')): ... print w1, w2 ... >>> for tagged_sent in brown.tagged_sents(): ... process2(tagged_sent) ... provide enabling pass enabling give planning spent providing spent learning improve nursing propose increasing ....
The last one is to display words which are used as many type of POS.
>>> for word in data.conditions(): ... if len(data[word]) > 3: ... tags = data[word].keys() ... print word, ' '.join(tags) ... best ADJ ADV NP V better ADJ ADV V DET close ADV ADJ V N cut V N VN VD even ADV DET ADJ V hit V VD VN N lay ADJ V NP VD left VD ADJ N VN like CNJ V ADJ P near P ADJ ADV DET open ADJ V N ADV past N ADJ DET P present ADJ ADV N V read V VN VD NP right ADJ N DET ADV second NUM ADV DET N set VN V VD N
Picking up words who have more than 3 tags.