Similaritiy of words
Let's start with a sample; to calculate similarity of two words, 'cookbook' and 'instruction_book'.
>>> cb = wordnet.synset('cookbook.n.01') >>> ib = wordnet.synset('instruction_book.n.01') >>> cb.wup_similarity(ib) 0.9166666666666666
Although the calculation logic is uncler, but as number is close to 1, the similarity is higher. I think we can say 91.6666% should be very similar. Another example is to get distances between 2 words.
>>> ref = cb.hypernyms()[0] >>> cb.shortest_path_distance(ref) 1 >>> cb.shortest_path_distance(ib) 2
The hypernym of cb('cookbook.n.01') is assgined to ref. The value should be 'reference_book.n.01'. The first shortest_path_distance() is to get the shortest distance from cb to ref. The value should be 1 as ref is a hypernym of cb. The secnd one is between cb and ib('isntructionbook'). It is quite natural that the distance is 2 as cb and ib should have same hypernym.
Then try to use other word. How about similarity between cb(cookbook) and 'dog'?
>>> dog = wordnet.synsets('dog')[0] >>> dog.wup_similarity(cb) 0.38095238095238093
The value is changed to 38%, less similar. Are there any common hypernyms?
>>> dog.common_hypernyms(cb) [Synset('object.n.01'), Synset('whole.n.02'), Synset('physical_entity.n.01'), Sy nset('entity.n.01')]
Thre are some common hypernyms but all of them are so abstract words.
How about verbs? This example is comparing cook and bake.
>>> cook = wordnet.synset('cook.v.01') >>> bake = wordnet.synset('bake.v.02') >>> cook.wup_similarity(bake) 0.6666666666666666 >>> bake = wordnet.synset('bake.v.01')
66.6666% is similar enough? (Note: In the text book the results was 0.75. I guess wordnet has been updated.)
Another method, Leacock Chodorow (LCH) similarity is introduced.
>>> cb.path_similarity(ib) 0.3333333333333333 >>> cb.path_similarity(dog) 0.07142857142857142 >>> cb.lch_similarity(ib) 2.538973871058276 >>> cb.lch_similarity(dog) 0.9985288301111273
I can guess as number is higher, similarity is also higher. Comparing with path_similarity(), it is not clear the provided number means high similarity or not. In terms of that, path_similarity is more user friendly, especially for newbies like me.