Recoginzing Textual Entailment (6.2.3)

Recognizing Textual Entailment (6.2.3)

Save as rte_features.py with following source code.

import nltk

def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtactor(rtpair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hype_extra'] = len(extractor.hyp_extra('ne'))
    return features

Then analyze the sample in the text book. Need to understand step by step.

>>> import rte_features
>>> repair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
>>> extractor = nltk.RTEFeatureExtractor(repair)
>>> print extractor.text_words
set(['Russia', 'Organisation', 'Shanghai', 'Asia', 'four', 'at', 'operation', 'SCO', 'Iran', 'Soviet', 'Davudi', 'fight', 'China', 'association', 'fledgling', 'terrorism', 'was', 'that', 'republics', 'Co', 'representing', 'former', 'Parviz', 'central', 'meeting', 'together', 'binds'])

This step is to extract words in the text, but 'stop words', e.g. "a" and "and", are already excluded.

>>> print extractor.hyp_words
set(['member', 'SCO', 'China'])

This is the same logic but not in the text but in the hypothesis. I guess "is", "a" and "of" are in the list of the 'stop words'.

>>> print extractor.overlap('word')
set([])

Nothing returned. This command should be to extract words (excluding Named Entities) in both the text and the hypothesis.

>>> print extractor.overlap('ne')
set(['SCO', 'China'])

Then this one is to extract ne(=Named entities) in the both.

>>> print extractor.hyp_extra('word')
set(['member'])

The last one is to extract words (excl. ne) which are only in the hypothesis.