Sequence Classification (6.1.6)
この本の写経シリーズは英語で始めてしまったので、とりあえず英語のまま行きます。特に深い意味はありません。
As I have already started this series (learning NLTK) in English, continue to write in English.
Sequence Classification (6.1.6)
This sample is to additionally have historical data. As the code is a little bit long to write all in the command lines, I wrote down in a text file and saved with name pos_feat.py.
import nltk def pos_features(sentence, i, history): features = {"suffix(1)": sentence[i][-1:], "suffix(2)": sentence[i][-2:], "suffix(3)": sentence[i][-3:]} if i == 0: features["prev-word"] = "<START>" features["prev-tag"] = "<START>" else: features["prev-word"] = sentence[i-1] features["prev-tag"] = history[i-1] return features class ConsecutivePosTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = pos_features(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) self.classifier = nltk.NaiveBayesClassifier.train(train_set) def tag(self, sentence): history = [] for i, word in enumerate(sentence): featureset = pos_features(sentence, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history)
Then import.
>>> import pos_feat >>> from pos_feat import *
Just for remind, in case adjusted the code, you can reload the new version of the script with reload() command.
>>> reload(pos_feat) <module 'pos_feat' from '/Users/ken/Documents/workspace/NLTK Learning/scripts/pos_feat.pyc'>
The remaining part is same as other examples.
>>> from nltk.corpus import brown >>> tagged_sents = brown.tagged_sents(categories='news') >>> size = int(len(tagged_sents) * 0.1) >>> train_sents, test_sents = tagged_sents[size:],tagged_sents[:size] >>> tagger = ConsecutivePosTagger(train_sents) >>> print tagger.evaluate(test_sents) 0.79796012981
Improved less than 1 point(%).