Further Examples of Supervised Classification (6.2)

Sentence Segmentation (6.2.1)

>>> sents = nltk.corpus.treebank_raw.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in nltk.corpus.treebank_raw.sents():
...     tokens.extend(sent)
...     offset += len(sent)
...     boundaries.add(offset-1)
... 

Let's check values.

>>> tokens[:10]
['.', 'START', '.', 'START', 'Pierre', 'Vinken', ',', '61', 'years', 'old']
>>> boundaries
set([8192, 1, 76379, 24580, 66902, 49158, 57357, 32782, 32783, 87384, 20, 49173,
....
 49127, 73710, 24559, 65521, 90098, 57331, 39593, 16376, 95604, 94251, 40959])

tokens seems a list of words as already familiar. boundaries for the last character of each sentence, more precisely say that the index of last characters as they are numbers.

>>> def punct_features(tokens, i):
...     return {'next-word-capitalized': tokens[i+1][0].isupper(),
...             'prevword': tokens[i-1].lower(),
...             'punct': tokens[i],
...             'prv-word-is-one-char': len(tokens[i-1]) == 1}
... 
>>> features = [(punct_features(tokens, i), (i in boundaries))
...             for i in range(1, len(tokens)-1)
...             if tokens[i] in '.?!']
>>> size = int(len(features) * 0.1)
>>> train_set, test_set = features[size:], features[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.917741935483871
>>> 
||< 

What's inside of <b>features</b>?
>|python|
>>> features
[({'next-word-capitalized': True, 'punct': '.', 'prv-word-is-one-char': False, 'prevword': 'start'}, False), ({'next-word-capitalized': False, 'punct': '.', 'prv-word-is-one-char': False, 'prevword': 'nov'}, True), 
....

There are several elements inside. 1) to check the next word start with capitals (next-word-capitalized), 2) punctuation likely the last character of each sentence (punct), 3) to check the previous word is one character length (prv-word-is-one-char), 4) the previous word (prevword) and 5) to check the last character is either '.', '?' or '!'.

>>> def segment_sentence(words):
...     start = 0
...     sents = []
...     for i, word in enumerate(words):
...             if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
...                     sents.append(words[start:i+1])
...                     start = i+1
...     if start < len(words):
...             sents.append(words[start:])
...     return sents
...