Automatic tagging (5.4-5.4.2)
Start preparation:
>>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news')
To check which tag is most frequently used. It's 'NN'.
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>> nltk.FreqDist(tags).max() 'NN'
Then assign the most popular tag to each word.
>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!' >>> tokens = nltk.word_tokenize(raw) >>> default_tagger = nltk.DefaultTagger('NN') >>> default_tagger.tag(tokens) [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
Even though 'NN' is most frequently used, it is just around 13%.
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
Another approach is here.
>>> patterns = [ ... (r'.*ing$', 'VBG'), ... (r'.*ed$', 'VBD'), ... (r'.*es$', 'VBZ'), ... (r'.*ould$', 'MD'), ... (r'.*\'s$', 'NN$'), ... (r'.*s$', 'NNS'), ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), ... (r'.*', 'NN') ... ] >>> regexp_tagger = nltk.RegexpTagger(patterns) >>> regexp_tagger.tag(brown_sents[3]) [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')] >>> regexp_tagger.evaluate(brown_tagged_sents) 0.20326391789486245
This logic seems to assign different tags based on the spelling of words. If not match with any patterns, default tag ('NN') is assigned. As a result, the evaluation result was improved to 20%.