Further Examples of Supervised Classification (6.2)
Sentence Segmentation (6.2.1)
>>> sents = nltk.corpus.treebank_raw.sents() >>> tokens = [] >>> boundaries = set() >>> offset = 0 >>> for sent in nltk.corpus.treebank_raw.sents(): ... tokens.extend(sent) ... offset += len(sent) ... boundaries.add(offset-1) ...
Let's check values.
>>> tokens[:10] ['.', 'START', '.', 'START', 'Pierre', 'Vinken', ',', '61', 'years', 'old'] >>> boundaries set([8192, 1, 76379, 24580, 66902, 49158, 57357, 32782, 32783, 87384, 20, 49173, .... 49127, 73710, 24559, 65521, 90098, 57331, 39593, 16376, 95604, 94251, 40959])
tokens seems a list of words as already familiar. boundaries for the last character of each sentence, more precisely say that the index of last characters as they are numbers.
>>> def punct_features(tokens, i): ... return {'next-word-capitalized': tokens[i+1][0].isupper(), ... 'prevword': tokens[i-1].lower(), ... 'punct': tokens[i], ... 'prv-word-is-one-char': len(tokens[i-1]) == 1} ... >>> features = [(punct_features(tokens, i), (i in boundaries)) ... for i in range(1, len(tokens)-1) ... if tokens[i] in '.?!'] >>> size = int(len(features) * 0.1) >>> train_set, test_set = features[size:], features[:size] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> nltk.classify.accuracy(classifier, test_set) 0.917741935483871 >>> ||< What's inside of <b>features</b>? >|python| >>> features [({'next-word-capitalized': True, 'punct': '.', 'prv-word-is-one-char': False, 'prevword': 'start'}, False), ({'next-word-capitalized': False, 'punct': '.', 'prv-word-is-one-char': False, 'prevword': 'nov'}, True), ....
There are several elements inside. 1) to check the next word start with capitals (next-word-capitalized), 2) punctuation likely the last character of each sentence (punct), 3) to check the previous word is one character length (prv-word-is-one-char), 4) the previous word (prevword) and 5) to check the last character is either '.', '?' or '!'.
>>> def segment_sentence(words): ... start = 0 ... sents = [] ... for i, word in enumerate(words): ... if word in '.?!' and classifier.classify(punct_features(words, i)) == True: ... sents.append(words[start:i+1]) ... start = i+1 ... if start < len(words): ... sents.append(words[start:]) ... return sents ...