Define Dictionary (5.3.3-5.3.4)

Defining a dictionary:

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos
{'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'}
>>> pos2 = dict(colorless='ADJ', sleep='V', ideas='N', furiously='ADV')
>>> pos2
{'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'}
>>> pos3 = {['ideas', 'blogs', 'adventures']: 'N'}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

The last one was failed because key of the dictionary must be immutable.

defaultdict is one of the solutions to avoid error when non-existing key was specified.

>>> frequency = nltk.defaultdict(int)
>>> frequency['colorless'] = 4
>>> frequency['ideas']
>>> pos = nltk.defaultdict(list)
>>> pos['sleep']

The fist example provides a default value as integer(int). The second one is list. If the specified key does not exist in the dictionary, default value (0 as integer and [] as list) was returned.

>>> pos = nltk.defaultdict(lambda: 'N')
>>> pos['colorless'] = 'ADJ'
>>> pos['blog']
>>> pos.items()
[('blog', 'N'), ('colorless', 'ADJ')]

How about this example? Default value is 'N' then added new entry without parameter.

This example is to replace less frequently used words (under Top 1000) with 'UNK' (means out of vocabulary??).

>>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> vocab = nltk.FreqDist(alice)
>>> v1000 = list(vocab)[:1000]
>>> mapping = nltk.defaultdict(lambda: 'UNK')
>>> for v in v1000:
...     mapping[v] = v
>>> alice2 = [mapping[v] for v in alice]
>>> alice2[:100]
['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'UNK', 'UNK', 'UNK', 'UNK', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'UNK', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'UNK', 'into', 'the', 'book', 'her', 'sister', 'was', 'UNK', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'UNK', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'", 'So', 'she', 'was', 'UNK', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',']
>>> len(set(alice2))
>>> len(alice2)

The value of set(alice2) should be 1001, top 1000 words plus "UNK".