Generic N-gram tagger (5.3.3)

Unigram tagger is to assign tags which are "probably" used. This is the restriction as each single word is focused in Unigram tagger. N-gramTagger is to check tags of neighbor words.

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> train_sents = brown_tagged_sents[:size]
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments',
'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace',
'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'),
('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), <strong>('so', 'CS')</strong>,
('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'),
('.', '.')]
>>> 

Compared with the result from Unigram, the tag of 'so' has been changed to 'CS' (in Unigram it was 'QL').

>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'),
('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None), ('million', None),
(',', None), ('divided', None), ('into', None), ('at', None),
('least', None), ('seven', None), ('major', None), ('``', None),
('culture', None), ('clusters', None), ("''", None), ('and', None),
('innumerable', None), ('tribes', None), ('speaking', None), ('400',
None), ('separate', None), ('dialects', None), ('.', None)]
>>> test_sents = brown_tagged_sents[size:]
>>> bigram_tagger.evaluate(test_sents)
0.10216286255357321

If the word is not in the training data, bigram cannot specify tag then assign "None". Later on tags for all words are "None" as there was no "None" tag in the training data.