Choosing the Right Features (6.1.2)

In my understanding, the example is try to explain "overfit" situation.

>>> def gender_features2(name):
...     features = {}
...     features["firstletter"] = name[0].lower()
...     features["lastletter"] = name[-1].lower()
...     for letter in 'abcdefghijklmnopqrstuvwxyz':
...             features["count(%s)" % letter] = name.lower().count(letter)
...             features["has(%s)" % letter] = (letter in name.lower())
...     return features
... 
>>> gender_features2('John')
{'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}

Then evaluate the result.

>>> featuresets = [(gender_features2(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.778

But the result was better than the previous one 0.758-->0.778. Anyway it was not difficult to understand the concept itself. If too many features are specified in the train data, this might not be optimized for entire sample data.

This one is to split sample data smaller to seek better features.

>>> train_name = names[1500:]
>>> devtest_name = names[500:1500]
>>> test_names = names[:500]
>>> 
>>> train_set = [(gender_features(n), g) for (n,g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.77

The sample data was split into 3, train_name(1500 to the end), devtest_name(500 to 1500) and test_name(first 500). Training is done with train_name then evaluate with deftest_set.

Then check out errors.

>>> errors = []
>>> for (name, tag) in devtest_name:
...     guess = classifier.classify(gender_features(name))
...     if guess != tag:
...             errors.append((tag, guess, name))
... 
>>> for (tag, guess, name) in sorted(errors):
...     print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)
... 
correct=feamale  guess=male     name=Abigael                       
correct=feamale  guess=male     name=Allyson                       
correct=feamale  guess=male     name=Alys                          
correct=feamale  guess=male     name=Angil                         
correct=feamale  guess=male     name=Arlyn                         
correct=feamale  guess=male     name=Aurel                         
correct=feamale  guess=male     name=Avis                          
correct=feamale  guess=male     name=Avril                         
correct=feamale  guess=male     name=Bell                          
correct=feamale  guess=male     name=Bev                           
correct=feamale  guess=male     name=Birgit                        
correct=feamale  guess=male     name=Bliss                         
correct=feamale  guess=male     name=Brandais                      
correct=feamale  guess=male     name=Brett                         
correct=feamale  guess=male     name=Brit                          
correct=feamale  guess=male     name=Brooks                        
correct=feamale  guess=male     name=Calypso                       
correct=feamale  guess=male     name=Carolann                      
correct=feamale  guess=male     name=Caroleen                      
correct=feamale  guess=male     name=Carolyn                       
correct=feamale  guess=male     name=Carrol                        
correct=feamale  guess=male     name=Caryl                         
correct=feamale  guess=male     name=Cat                           
correct=feamale  guess=male     name=Cathryn                       
correct=feamale  guess=male     name=Charis                        
correct=feamale  guess=male     name=Christan                      
correct=feamale  guess=male     name=Christean                     
correct=feamale  guess=male     name=Cindelyn                      
correct=feamale  guess=male     name=Consuelo                      
correct=feamale  guess=male     name=Cyb                           
correct=feamale  guess=male     name=Cybel                         
correct=feamale  guess=male     name=Daniel                        
correct=feamale  guess=male     name=Darb                          
correct=feamale  guess=male     name=Dawn                          
correct=feamale  guess=male     name=Delores                       
correct=feamale  guess=male     name=Devan                         
correct=feamale  guess=male     name=Devin                         
correct=feamale  guess=male     name=Diamond                       
correct=feamale  guess=male     name=Dorcas                        
correct=feamale  guess=male     name=Dot                           
correct=feamale  guess=male     name=Estel                         
correct=feamale  guess=male     name=Evaleen                       
correct=feamale  guess=male     name=Evangelin                     
correct=feamale  guess=male     name=Fanchon                       
correct=feamale  guess=male     name=Farrand                       
correct=feamale  guess=male     name=Fern                          
correct=feamale  guess=male     name=Flor                          
correct=feamale  guess=male     name=Gabriel                       
correct=feamale  guess=male     name=Gabriell                      
correct=feamale  guess=male     name=Gill                          
correct=feamale  guess=male     name=Ginnifer                      
correct=feamale  guess=male     name=Glad                          
correct=feamale  guess=male     name=Hazel                         
correct=feamale  guess=male     name=Ines                          
correct=feamale  guess=male     name=Ingaborg                      
correct=feamale  guess=male     name=Iris                          
correct=feamale  guess=male     name=Jennifer                      
correct=feamale  guess=male     name=Jill                          
correct=feamale  guess=male     name=Jillian                       
correct=feamale  guess=male     name=Jocelin
....
correct=male     guess=feamale  name=Wylie                         
correct=male     guess=feamale  name=Yehudi                        
correct=male     guess=feamale  name=Yule                          
correct=male     guess=feamale  name=Zechariah                     
correct=male     guess=feamale  name=Zeke                          
>>> 

According to the result of errors, adjust features. In this example, to check last 2 characters of each name.

>>> def gender_features(word):
...     return {'suffix1': word[-1:], 'suffix2': word[-2:]}
... 
>>> train_set = [(gender_features(n), g) for (n, g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n, g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.786

The result was improved 1.6 point (0.77-->0.786).