WordNet and Hypernyms
WordNet is a lexical database, a kind of dictionary.
Japanese: http://ja.wikipedia.org/wiki/WordNet
English: http://en.wikipedia.org/wiki/WordNet
NLTK has a simple interface to WordNet. Synset is groups of similar meaning words. A word belongs to one synset but sometime multiple synsets in case the word has various meaing.
How to import WordNet:
>>> from nltk.corpus import wordnet
As an example, use a synset "cookbook".
>>> syn = wordnet.synsets('cookbook')[0] >>> syn.name 'cookbook.n.01' >>> syn.definition 'a book of recipes and cooking directions'
It is a little bit unclear, especially meaning of "[0]". I tried some different words.
>>> wordnet.synsets('sugar') [Synset('sugar.n.01'), Synset('carbohydrate.n.01'), Synset('boodle.n.01'), Synse t('sugar.v.01')] >>> wordnet.synsets('salt') [Synset('salt.n.01'), Synset('salt.n.02'), Synset('strategic_arms_limitation_tal ks.n.01'), Synset('salt.n.04'), Synset('salt.v.01'), Synset('salt.v.02'), Synset ('salt.v.03'), Synset('salt.v.04'), Synset('salt.s.01')] >>>
A word "sugar" is included in 4 synsets, sugar.n.01, carbohydrate.n.01, boodle.n.01, sugar.v.01. For "salt", there are 9 synsests. [0] means the fisrt element of list in Python. Therefore wordnet.synsets('cookbook')[0] stands for the first element of synsets where 'cookbook' is assigned.
>>> wordnet.synsets('cookbook') [Synset('cookbook.n.01')]
Anyway, only one synset exists for this word. Let's move on. Synsets have some example sentences.
>>> wordnet.synsets('cooking')[0].examples ['cooking can be a great art', 'people are needed who have experience in cookery ', 'he left the preparation of meals to his wife']
This is the same as "sugar" and "salt" example.
>>> wordnet.synsets('cooking') [Synset('cooking.n.01'), Synset('cook.v.01'), Synset('cook.v.02'), Synset('cook. v.03'), Synset('fudge.v.01'), Synset('cook.v.05')] >>> wordnet.synsets('cooking')[0] Synset('cooking.n.01')
Hypernym means a word that is more generic than a given word. This command to display hypernym(s) of a word "cookbook".
>>> syn.hypernyms()
[Synset('reference_book.n.01')]
This one to display hyperinym of hyponyms. Hyponyms is a word taht is more specific than a given word.
>>> syn.hypernyms()[0].hyponyms() [Synset('encyclopedia.n.01'), Synset('directory.n.01'), Synset('source_book.n.01 '), Synset('handbook.n.01'), Synset('instruction_book.n.01'), Synset('cookbook.n .01'), Synset('annual.n.02'), Synset('atlas.n.02'), Synset('wordbook.n.01')]
"cookbook" has "reference_book" as a hypernym then display hyponyms of "reference_book" in this example. Of course, "cookbook" is included as one of hyponyms. Others could be 'neibors' of "cookbook".
root_hypernyms() should be the highest level. In this case, "entity" is the one.
>>> syn.root_hypernyms()
[Synset('entity.n.01')]
This command is to show the path from the highest level to the bottom level("cookbook").
>>> syn.hypernym_paths() [[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('creation.n.02'), Synset(' product.n.02'), Synset('work.n.02'), Synset('publication.n.01'), Synset('book.n. 01'), Synset('reference_book.n.01'), Synset('cookbook.n.01')]]
The next command answers a question what is the meaning of "n" after word and period(.).
>>> syn.pos
'n'
Now it is clear. That's part-of-speech (POS).
n: Noun
a: Adjective
r: Adverb
v: Verb
Using len, we can know how many elements are included.
>>> len(wordnet.synsets('great')) 7 >>> len(wordnet.synsets('great', pos='n')) 1 >>> len(wordnet.synsets('great', pos='a')) 6 >>> wordnet.synsets('great') [Synset('great.n.01'), Synset('great.s.01'), Synset('great.s.02'), Synset('great .s.03'), Synset('bang-up.s.01'), Synset('capital.s.03'), Synset('big.s.13')] >>>
In this example, there is only 1 Noun and 6 adjectives in the list. However, I got an interesting error when I tried as follows.
>>> len(wordnet.synsets('great', pos='s')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1200, in synsets for form in self._morphy(lemma, p) File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1380, in _morphy substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos] KeyError: 's' >>>
As you can see 6 elements have 's', guess 's' stands for synonym. Then the system gets a POS of the synonyms then return their POSs.