Document classification (6.1.3)
Construct correctly labeled document.
>>> from nltk.corpus import movie_reviews >>> decoments = [(list(movie_reviews.words(fileid)), category) ... for category in movie_reviews.categories() ... for fileid in movie_reviews.fileids(category)] >>> random.shuffle(decoments)
Let's ignore my small typo here (decoments-->documents). But it looks like to generate list of words, isn't it?
Then check the words are used in the specified document.
>>> def document_features(document): ... document_words = set(document) ... features = {} ... for word in word_features: ... features['contains(%s)' % word] = (word in document_words) ... return features ... >>> print document_features(movie_reviews.words('pos/cv957_8737.txt')) {'contains(waste)': False, 'contains(lot)': False, 'contains(*)': True, 'contains(black)': .... 'contains(towards)': False, 'contains(smile)': False, 'contains(cross)': False} >>>
Now got it. The purpose is to evaluate that words are used in positive or negative context.
>>> featuresets = [(document_features(d), c) for (d, c) in decoments] >>> train_set, test_set = featuresets[100:], featuresets[:100] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, test_set) 0.82 >>> classifier.show_most_informative_features(5) Most Informative Features contains(outstanding) = True pos : neg = 10.1 : 1.0 contains(mulan) = True pos : neg = 8.5 : 1.0 contains(seagal) = True neg : pos = 7.3 : 1.0 contains(wonderfully) = True pos : neg = 6.6 : 1.0 contains(damon) = True pos : neg = 6.0 : 1.0