Separating Training and Test data (5.5.2)

What a busy week! Today's topic is also short version.

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.8110236220472441

UnigramTagger is to use already-tagged-data as a parameter. This concept is called 'Training'. Today's example is to use 90% of data for training. Then start tagging for the remaining 10% and evaluate the result.

By the way, do we still need to use 90% data for training? I change the ratio (0.5 and 0.2) and got following results.

>>> size = int(len(brown_tagged_sents) * 0.5)
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.7656880697816371
>>> size = int(len(brown_tagged_sents) * 0.2)
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.6882990411506192

Seems not too bad.