Supervised classifying (6.1-6.1.1)

Go into the new chapter, Chapter 6 of the whale book.

We learned there are some relationship between the last character of first name and gender at Chapter 2.4. Going to use same sample here.

This function is to get the last character of the name.

>>> def gender_features(word):
...     return {'last_letter': word[-1]}
... 
>>> gender_features('Shrek')
{'last_letter': 'k'}

Here import the Name corpus and shuffle entries.

>>> from nltk.corpus import names
>>> import random
>>> names = ([(name, 'male') for name in names.words('male.txt')] +
...          [(name, 'feamale') for name in names.words('female.txt')])
>>> random.shuffle(names)

Note: Here is a typo; wrong:feamale, correct:female

After that extract the last character (by using gender_features()) and gender. The fist 500 record to be used for training, the remaining for testing.

>>> featuresets = [(gender_features(n), g) for (n, g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> classifier.classify(gender_features('Neo'))
'male'
>>> classifier.classify(gender_features('Trinity'))
'feamale'

Evaluate the accuracy of test_set.

>>> print nltk.classify.accuracy(classifier, test_set)
0.756
>>> classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
>>> 

This result clearly says that if the name ended with 'k' is most likely male and if 'a' for female. But I tried this afterwards, my impression was slightly changed.

>>> classifier.show_most_informative_features(26)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
             last_letter = 'm'              male : feamal =     10.1 : 1.0
             last_letter = 'd'              male : feamal =      9.0 : 1.0
             last_letter = 'o'              male : feamal =      7.7 : 1.0
             last_letter = 'r'              male : feamal =      7.2 : 1.0
             last_letter = 'w'              male : feamal =      5.8 : 1.0
             last_letter = 'g'              male : feamal =      5.3 : 1.0
             last_letter = 't'              male : feamal =      4.3 : 1.0
             last_letter = 's'              male : feamal =      4.1 : 1.0
             last_letter = 'b'              male : feamal =      4.1 : 1.0
             last_letter = 'z'              male : feamal =      4.0 : 1.0
             last_letter = 'j'              male : feamal =      4.0 : 1.0
             last_letter = 'i'            feamal : male   =      3.6 : 1.0
             last_letter = 'u'              male : feamal =      2.2 : 1.0
             last_letter = 'n'              male : feamal =      2.1 : 1.0
             last_letter = 'e'            feamal : male   =      1.9 : 1.0
             last_letter = 'l'              male : feamal =      1.8 : 1.0
             last_letter = 'h'              male : feamal =      1.5 : 1.0
             last_letter = 'x'              male : feamal =      1.4 : 1.0
             last_letter = 'y'              male : feamal =      1.2 : 1.0

Even though there are some exceptions (a, e, i), most of the case, male's names are majority. Therefore Female's names trend to be ended with specific characters like a, e and i rather than male's name. Anyway this is just my impression.

This is just additional information. This one can be used to avoid high memory consumptions.

>>> from nltk.classify import apply_features
>>> train_set = apply_features(gender_features, name[500:])
>>> test_set = apply_features(gender_features, name[:500])