Choosing the Right Features (6.1.2)
In my understanding, the example is try to explain "overfit" situation.
>>> def gender_features2(name): ... features = {} ... features["firstletter"] = name[0].lower() ... features["lastletter"] = name[-1].lower() ... for letter in 'abcdefghijklmnopqrstuvwxyz': ... features["count(%s)" % letter] = name.lower().count(letter) ... features["has(%s)" % letter] = (letter in name.lower()) ... return features ... >>> gender_features2('John') {'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}
Then evaluate the result.
>>> featuresets = [(gender_features2(n), g) for (n,g) in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, test_set) 0.778
But the result was better than the previous one 0.758-->0.778. Anyway it was not difficult to understand the concept itself. If too many features are specified in the train data, this might not be optimized for entire sample data.
This one is to split sample data smaller to seek better features.
>>> train_name = names[1500:] >>> devtest_name = names[500:1500] >>> test_names = names[:500] >>> >>> train_set = [(gender_features(n), g) for (n,g) in train_name] >>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_name] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) 0.77
The sample data was split into 3, train_name(1500 to the end), devtest_name(500 to 1500) and test_name(first 500). Training is done with train_name then evaluate with deftest_set.
Then check out errors.
>>> errors = [] >>> for (name, tag) in devtest_name: ... guess = classifier.classify(gender_features(name)) ... if guess != tag: ... errors.append((tag, guess, name)) ... >>> for (tag, guess, name) in sorted(errors): ... print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name) ... correct=feamale guess=male name=Abigael correct=feamale guess=male name=Allyson correct=feamale guess=male name=Alys correct=feamale guess=male name=Angil correct=feamale guess=male name=Arlyn correct=feamale guess=male name=Aurel correct=feamale guess=male name=Avis correct=feamale guess=male name=Avril correct=feamale guess=male name=Bell correct=feamale guess=male name=Bev correct=feamale guess=male name=Birgit correct=feamale guess=male name=Bliss correct=feamale guess=male name=Brandais correct=feamale guess=male name=Brett correct=feamale guess=male name=Brit correct=feamale guess=male name=Brooks correct=feamale guess=male name=Calypso correct=feamale guess=male name=Carolann correct=feamale guess=male name=Caroleen correct=feamale guess=male name=Carolyn correct=feamale guess=male name=Carrol correct=feamale guess=male name=Caryl correct=feamale guess=male name=Cat correct=feamale guess=male name=Cathryn correct=feamale guess=male name=Charis correct=feamale guess=male name=Christan correct=feamale guess=male name=Christean correct=feamale guess=male name=Cindelyn correct=feamale guess=male name=Consuelo correct=feamale guess=male name=Cyb correct=feamale guess=male name=Cybel correct=feamale guess=male name=Daniel correct=feamale guess=male name=Darb correct=feamale guess=male name=Dawn correct=feamale guess=male name=Delores correct=feamale guess=male name=Devan correct=feamale guess=male name=Devin correct=feamale guess=male name=Diamond correct=feamale guess=male name=Dorcas correct=feamale guess=male name=Dot correct=feamale guess=male name=Estel correct=feamale guess=male name=Evaleen correct=feamale guess=male name=Evangelin correct=feamale guess=male name=Fanchon correct=feamale guess=male name=Farrand correct=feamale guess=male name=Fern correct=feamale guess=male name=Flor correct=feamale guess=male name=Gabriel correct=feamale guess=male name=Gabriell correct=feamale guess=male name=Gill correct=feamale guess=male name=Ginnifer correct=feamale guess=male name=Glad correct=feamale guess=male name=Hazel correct=feamale guess=male name=Ines correct=feamale guess=male name=Ingaborg correct=feamale guess=male name=Iris correct=feamale guess=male name=Jennifer correct=feamale guess=male name=Jill correct=feamale guess=male name=Jillian correct=feamale guess=male name=Jocelin .... correct=male guess=feamale name=Wylie correct=male guess=feamale name=Yehudi correct=male guess=feamale name=Yule correct=male guess=feamale name=Zechariah correct=male guess=feamale name=Zeke >>>
According to the result of errors, adjust features. In this example, to check last 2 characters of each name.
>>> def gender_features(word): ... return {'suffix1': word[-1:], 'suffix2': word[-2:]} ... >>> train_set = [(gender_features(n), g) for (n, g) in train_name] >>> devtest_set = [(gender_features(n), g) for (n, g) in devtest_name] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) 0.786
The result was improved 1.6 point (0.77-->0.786).