Sequence Classification (6.1.6)

この本の写経シリーズは英語で始めてしまったので、とりあえず英語のまま行きます。特に深い意味はありません。

As I have already started this series (learning NLTK) in English, continue to write in English.

This sample is to additionally have historical data. As the code is a little bit long to write all in the command lines, I wrote down in a text file and saved with name pos_feat.py.

import nltk

def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                         "suffix(2)": sentence[i][-2:],
                         "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
                
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

Then import.

>>> import pos_feat
>>> from pos_feat import *

Just for remind, in case adjusted the code, you can reload the new version of the script with reload() command.

>>> reload(pos_feat)
<module 'pos_feat' from '/Users/ken/Documents/workspace/NLTK Learning/scripts/pos_feat.pyc'>

The remaining part is same as other examples.

>>> from nltk.corpus import brown
>>> tagged_sents = brown.tagged_sents(categories='news')
>>> size = int(len(tagged_sents) * 0.1)
>>> train_sents, test_sents = tagged_sents[size:],tagged_sents[:size]
>>> tagger = ConsecutivePosTagger(train_sents)
>>> print tagger.evaluate(test_sents)
0.79796012981

Improved less than 1 point(%).