Compounded keys and values (5.3.6-) - Deutschina's Tech Diary

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int))                       >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)
>>> for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged):
...     pos[(t1, w2)][t2] += 1
... 
>>> pos[('DET', 'right')]
defaultdict(&lt;type 'int'>, {'ADV': 3, 'ADJ': 9, 'N': 4})
>>>

Key of the pos is (t1, w2). As this is to get bigrams from tagged corpus, t1 is for tag (POS) of previous word. w2 is the current word. pos[('DET', 'right')] returns which POS is used for word 'right' if POS of previous word is Determiner(DET, like a, the). According to the result, ADJ is most frequently used.

This is introduced as time consuming process. (although it took less than a second in my machine.)

>>> for word in nltk.corpus.gutenberg.words('milton-paradise.txt'):
...     counts[word] += 1
... 
>>> [key for (key, value) in counts.items() if value == 32]
['brought', 'Him', 'virtue', 'Against', 'There', 'thine', 'King', 'mortal', 'every', 'been']

It seems extract words which are used 32 times in the corpus.

In this example, there are two dictionaries. The first one is the combination of (word, pos), the second one is (pos, word).

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos2 = dict((value, key) for (key, value) in pos.items())
NameError: name 'post2' is not defined
>>> pos2['N']
'ideas'

Maybe can we like this?

>>> counts2 = dict((value, key) for (key, value) in counts.items())
>>> [key for (value, key) in counts2.items() if value == 32]
['been']

Only one word(been) is displayed. The reason should be there are multiple words with same value(32).

To avoid the problem above, use append().

>>> pos.update({'cats': 'N', 'scrathch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
>>> pos2 = nltk.defaultdict(list)
>>> for key, value in pos.items():
...     pos2[value].append(key)
... 
>>> pos2['ADV']
['peacefully', 'furiously']

Easier version:

>>> pos2 = nltk.Index((value, key) for (key, value) in pos.items())
>>> pos2['ADV']
['peacefully', 'furiously']