Separating Training and Test data (5.5.2)
What a busy week! Today's topic is also short version.
>>> size = int(len(brown_tagged_sents) * 0.9) >>> size 4160 >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagger(train_sents) >>> unigram_tagger.evaluate(test_sents) 0.8110236220472441
UnigramTagger is to use already-tagged-data as a parameter. This concept is called 'Training'. Today's example is to use 90% of data for training. Then start tagging for the remaining 10% and evaluate the result.
By the way, do we still need to use 90% data for training? I change the ratio (0.5 and 0.2) and got following results.
>>> size = int(len(brown_tagged_sents) * 0.5) >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagger(train_sents) >>> unigram_tagger.evaluate(test_sents) 0.7656880697816371 >>> size = int(len(brown_tagged_sents) * 0.2) >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagger(train_sents) >>> unigram_tagger.evaluate(test_sents) 0.6882990411506192
Seems not too bad.