Decision Trees (6.4)
Entropy and Information Gain (6.4.1)
Try to execute entropy calculation sample.
>>> import nltk >>> from nltk_init import * >>> import math >>> def entoropy(labels): ... freqdist = nltk.FreqDist(labels) ... probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] ... return -sum([p * math.log(p,2) for p in probs]) ... >>> print entoropy(['male', 'male', 'male', 'male']) -0.0 >>> print entoropy(['male', 'female', 'male', 'female']) 1.0 >>> print entoropy(['female', 'female', 'male', 'female']) 0.811278124459 >>> print entoropy(['female', 'female', 'female', 'female']) -0.0 >>>
Let's do debugging for better understanding.
>>> import pdb >>> pdb.run("print entoropy(['male', 'male', 'male', 'male'])") > <string>(1)<module>()->None (Pdb) s --Call-- > <stdin>(1)entoropy() (Pdb) s > <stdin>(2)entoropy() (Pdb) s --Call-- > /Library/Python/2.7/site-packages/nltk/probability.py(85)__init__() -> def __init__(self, samples=None): (Pdb) r --Return-- > /Library/Python/2.7/site-packages/nltk/probability.py(105)__init__()->None -> self.update(samples) (Pdb) r > <stdin>(3)entoropy() (Pdb) p probs *** NameError: NameError("name 'probs' is not defined",) (Pdb) p [freqdist.freq(l) for l in nltk.FreqDist(labels)] [1.0]
This is frequency of 'male', therefore the value is 1.0.
(Pdb) s --Call-- > /Library/Python/2.7/site-packages/nltk/probability.py(85)__init__() -> def __init__(self, samples=None): (Pdb) r --Return-- > /Library/Python/2.7/site-packages/nltk/probability.py(105)__init__()->None -> self.update(samples) (Pdb) r > <stdin>(3)entoropy() (Pdb) p math.log(1.0,2) 0.0 (Pdb) q >>>
log 1 is always zero. Then the result was (-)0.0. Seems fine to understand. How about others? We can check to simulate the return from function entropy().
>>> 0.5 * math.log(0.5,2) -0.5 >>> 0.25 * math.log(0.25,2) -0.5 >>> 0.75 * math.log(0.75,2) -0.31127812445913283 ||< If the probability of 'male' and 'female' are 0.5 and 0.5, the return would be >|| -((-0.5)+(-0.5)) = 1.0
If 'male' vs 'female' are 0.75 and 0.25,
-((-0.5)+(-0.31127812445913283)) = 0.811278124459
Sounds reasonable.