Decision Trees (6.4)

Entropy and Information Gain (6.4.1)

Try to execute entropy calculation sample.

>>> import nltk
>>> from nltk_init import *
>>> import math
>>> def entoropy(labels):
...     freqdist = nltk.FreqDist(labels)
...     probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
...     return -sum([p * math.log(p,2) for p in probs])
... 
>>> print entoropy(['male', 'male', 'male', 'male'])
-0.0
>>> print entoropy(['male', 'female', 'male', 'female'])
1.0
>>> print entoropy(['female', 'female', 'male', 'female'])
0.811278124459
>>> print entoropy(['female', 'female', 'female', 'female'])
-0.0
>>> 

Let's do debugging for better understanding.

>>> import pdb
>>> pdb.run("print entoropy(['male', 'male', 'male', 'male'])")
> <string>(1)<module>()->None
(Pdb) s
--Call--
> <stdin>(1)entoropy()
(Pdb) s
> <stdin>(2)entoropy()
(Pdb) s
--Call--
> /Library/Python/2.7/site-packages/nltk/probability.py(85)__init__()
-> def __init__(self, samples=None):
(Pdb) r
--Return--
> /Library/Python/2.7/site-packages/nltk/probability.py(105)__init__()->None
-> self.update(samples)
(Pdb) r
> <stdin>(3)entoropy()
(Pdb) p probs
*** NameError: NameError("name 'probs' is not defined",)
(Pdb) p [freqdist.freq(l) for l in nltk.FreqDist(labels)]
[1.0]

This is frequency of 'male', therefore the value is 1.0.

(Pdb) s
--Call--
> /Library/Python/2.7/site-packages/nltk/probability.py(85)__init__()
-> def __init__(self, samples=None):
(Pdb) r
--Return--
> /Library/Python/2.7/site-packages/nltk/probability.py(105)__init__()->None
-> self.update(samples)
(Pdb) r
> <stdin>(3)entoropy()
(Pdb) p math.log(1.0,2)
0.0
(Pdb) q
>>> 

log 1 is always zero. Then the result was (-)0.0. Seems fine to understand. How about others? We can check to simulate the return from function entropy().

>>> 0.5 * math.log(0.5,2)
-0.5
>>> 0.25 * math.log(0.25,2)
-0.5
>>> 0.75 * math.log(0.75,2)
-0.31127812445913283
||< 
If the probability of 'male' and 'female' are 0.5 and 0.5, the return would be
>||
-((-0.5)+(-0.5)) = 1.0 

If 'male' vs 'female' are 0.75 and 0.25,

-((-0.5)+(-0.31127812445913283)) = 0.811278124459

Sounds reasonable.