Document classification (6.1.3)

Construct correctly labeled document.

>>> from nltk.corpus import movie_reviews
>>> decoments = [(list(movie_reviews.words(fileid)), category)
...             for category in movie_reviews.categories()
...             for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(decoments)

Let's ignore my small typo here (decoments-->documents). But it looks like to generate list of words, isn't it?

Then check the words are used in the specified document.

>>> def document_features(document):
...     document_words = set(document)
...     features = {}
...     for word in word_features:
...             features['contains(%s)' % word] = (word in document_words)
...     return features
... 
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, 'contains(*)': True, 'contains(black)':
....

'contains(towards)': False, 'contains(smile)': False, 'contains(cross)': False}
>>> 

Now got it. The purpose is to evaluate that words are used in positive or negative context.

>>> featuresets = [(document_features(d), c) for (d, c) in decoments]
>>> train_set, test_set = featuresets[100:], featuresets[:100]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.82
>>> classifier.show_most_informative_features(5)
Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.1 : 1.0
         contains(mulan) = True              pos : neg    =      8.5 : 1.0
        contains(seagal) = True              neg : pos    =      7.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.6 : 1.0
         contains(damon) = True              pos : neg    =      6.0 : 1.0