Word collocations

According to the text book, Collocations are two or more words that tend to appear frequently together. This was also introduced in chapter 1 of the O'Reilly's text.

>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('grail.txt')]
>>> bcf = BigramCollocationFinder.from_words(words)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]

A little bit complicated... Let's assume what happend.
The first 3 lines are importing sample text or methods. Then generate a list (words) of words in grail.txt after converting into lower cases. The list was sent to BigramCollocationFinder.from_words. Finally output top collocations. Maybe 4 stands for top 4 collocations.

As described in the text, the result does not make sense. The next sample is try to set a filter.

>>> from nltk.corpus import stopwords
>>> stopset = set(stopwords.words('english'))
>>> filter_stops = lambda w: len(w) < 3 or w in stopset
>>> bcf.apply_word_filter(filter_stops)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[('black', 'knight'), ('head', 'knight'), ('holy', 'grail'), ('run', 'away')]

The results look much better than previous one. lambda is totally new for me. After googling, I understand lambda is a kind of function definition (maybe casual version of def:). As a result, filter_stops will be a list of words whose length is less than 3 or which is included in stopset. After that filter_stops is used as a filter and those words are excluded from top 4 collocations.

In this example, another tool, TrigramCollocationFinder is used. That is really interesting for me that there are more than one tools are prepared for similar purposes in NLTK.

>>> from nltk.collocations import TrigramCollocationFinder
>>> from nltk.metrics import TrigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('singles.txt')]
>>> tcf = TrigramCollocationFinder.from_words(words)
>>> tcf.apply_word_filter(filter_stops)
>>> tcf.apply_freq_filter(3)
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4)
[('long', 'term', 'relationship')]

filter_stops should come from the previous one. apply_freq_filter() is to set a threshold number. In this case, the system ignores if the frequency is less than 3 times. The result is only one set regardless 4 is still put as a parameter in nbest(). However, I still believe this option is to restrict numbers of display. I tried following but results are all the same. Maybe only one combination meet the criteria.

>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4)
[('long', 'term', 'relationship')]
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 1)
[('long', 'term', 'relationship')]
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 2)
[('long', 'term', 'relationship')]
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 3)
[('long', 'term', 'relationship')]
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 5)
[('long', 'term', 'relationship')]