Word collocations
According to the text book, Collocations are two or more words that tend to appear frequently together. This was also introduced in chapter 1 of the O'Reilly's text.
>>> from nltk.corpus import webtext >>> from nltk.collocations import BigramCollocationFinder >>> from nltk.metrics import BigramAssocMeasures >>> words = [w.lower() for w in webtext.words('grail.txt')] >>> bcf = BigramCollocationFinder.from_words(words) >>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4) [("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]
A little bit complicated... Let's assume what happend.
The first 3 lines are importing sample text or methods. Then generate a list (words) of words in grail.txt after converting into lower cases. The list was sent to BigramCollocationFinder.from_words. Finally output top collocations. Maybe 4 stands for top 4 collocations.
As described in the text, the result does not make sense. The next sample is try to set a filter.
>>> from nltk.corpus import stopwords >>> stopset = set(stopwords.words('english')) >>> filter_stops = lambda w: len(w) < 3 or w in stopset >>> bcf.apply_word_filter(filter_stops) >>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4) [('black', 'knight'), ('head', 'knight'), ('holy', 'grail'), ('run', 'away')]
The results look much better than previous one. lambda is totally new for me. After googling, I understand lambda is a kind of function definition (maybe casual version of def:). As a result, filter_stops will be a list of words whose length is less than 3 or which is included in stopset. After that filter_stops is used as a filter and those words are excluded from top 4 collocations.
In this example, another tool, TrigramCollocationFinder is used. That is really interesting for me that there are more than one tools are prepared for similar purposes in NLTK.
>>> from nltk.collocations import TrigramCollocationFinder >>> from nltk.metrics import TrigramAssocMeasures >>> words = [w.lower() for w in webtext.words('singles.txt')] >>> tcf = TrigramCollocationFinder.from_words(words) >>> tcf.apply_word_filter(filter_stops) >>> tcf.apply_freq_filter(3) >>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4) [('long', 'term', 'relationship')]
filter_stops should come from the previous one. apply_freq_filter() is to set a threshold number. In this case, the system ignores if the frequency is less than 3 times. The result is only one set regardless 4 is still put as a parameter in nbest(). However, I still believe this option is to restrict numbers of display. I tried following but results are all the same. Maybe only one combination meet the criteria.
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4) [('long', 'term', 'relationship')] >>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 1) [('long', 'term', 'relationship')] >>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 2) [('long', 'term', 'relationship')] >>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 3) [('long', 'term', 'relationship')] >>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 5) [('long', 'term', 'relationship')]