Recoginzing Textual Entailment (6.2.3)
Recognizing Textual Entailment (6.2.3)
Save as rte_features.py with following source code.
import nltk def rte_features(rtepair): extractor = nltk.RTEFeatureExtactor(rtpair) features = {} features['word_overlap'] = len(extractor.overlap('word')) features['word_hyp_extra'] = len(extractor.hyp_extra('word')) features['ne_overlap'] = len(extractor.overlap('ne')) features['ne_hype_extra'] = len(extractor.hyp_extra('ne')) return features
Then analyze the sample in the text book. Need to understand step by step.
>>> import rte_features >>> repair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] >>> extractor = nltk.RTEFeatureExtractor(repair) >>> print extractor.text_words set(['Russia', 'Organisation', 'Shanghai', 'Asia', 'four', 'at', 'operation', 'SCO', 'Iran', 'Soviet', 'Davudi', 'fight', 'China', 'association', 'fledgling', 'terrorism', 'was', 'that', 'republics', 'Co', 'representing', 'former', 'Parviz', 'central', 'meeting', 'together', 'binds'])
This step is to extract words in the text, but 'stop words', e.g. "a" and "and", are already excluded.
>>> print extractor.hyp_words set(['member', 'SCO', 'China'])
This is the same logic but not in the text but in the hypothesis. I guess "is", "a" and "of" are in the list of the 'stop words'.
>>> print extractor.overlap('word') set([])
Nothing returned. This command should be to extract words (excluding Named Entities) in both the text and the hypothesis.
>>> print extractor.overlap('ne') set(['SCO', 'China'])
Then this one is to extract ne(=Named entities) in the both.
>>> print extractor.hyp_extra('word') set(['member'])
The last one is to extract words (excl. ne) which are only in the hypothesis.