Similaritiy of words - Deutschina's Tech Diary

Let's start with a sample; to calculate similarity of two words, 'cookbook' and 'instruction_book'.

>>> cb = wordnet.synset('cookbook.n.01')
>>> ib = wordnet.synset('instruction_book.n.01')
>>> cb.wup_similarity(ib)
0.9166666666666666

Although the calculation logic is uncler, but as number is close to 1, the similarity is higher. I think we can say 91.6666% should be very similar. Another example is to get distances between 2 words.

>>> ref = cb.hypernyms()[0]
>>> cb.shortest_path_distance(ref)
1
>>> cb.shortest_path_distance(ib)
2

The hypernym of cb('cookbook.n.01') is assgined to ref. The value should be 'reference_book.n.01'. The first shortest_path_distance() is to get the shortest distance from cb to ref. The value should be 1 as ref is a hypernym of cb. The secnd one is between cb and ib('isntructionbook'). It is quite natural that the distance is 2 as cb and ib should have same hypernym.

Then try to use other word. How about similarity between cb(cookbook) and 'dog'?

>>> dog = wordnet.synsets('dog')[0]
>>> dog.wup_similarity(cb)
0.38095238095238093

The value is changed to 38%, less similar. Are there any common hypernyms?

>>> dog.common_hypernyms(cb)
[Synset('object.n.01'), Synset('whole.n.02'), Synset('physical_entity.n.01'), Sy
nset('entity.n.01')]

Thre are some common hypernyms but all of them are so abstract words.

How about verbs? This example is comparing cook and bake.

>>> cook = wordnet.synset('cook.v.01')
>>> bake = wordnet.synset('bake.v.02')
>>> cook.wup_similarity(bake)
0.6666666666666666
>>> bake = wordnet.synset('bake.v.01')

66.6666% is similar enough? (Note: In the text book the results was 0.75. I guess wordnet has been updated.)

Another method, Leacock Chodorow (LCH) similarity is introduced.

>>> cb.path_similarity(ib)
 0.3333333333333333
 >>> cb.path_similarity(dog)
 0.07142857142857142
 >>> cb.lch_similarity(ib)
 2.538973871058276
 >>> cb.lch_similarity(dog)
 0.9985288301111273

I can guess as number is higher, similarity is also higher. Comparing with path_similarity(), it is not clear the provided number means high similarity or not. In terms of that, path_similarity is more user friendly, especially for newbies like me.