Performance limitation (5.5.7-5.5.8) - Deutschina's Tech Diary

>>> cfd = nltk.ConditionalFreqDist(
...             ((x[1], y[1], z[0]), z[1])
...             for sent in brown_tagged_sents
...             for x, y, z in nltk.trigrams(sent))
>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0
>>> from __future__ import division
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0.049297702068029296

In ambiguous_contexts, more than one different tags are assigned even under the same contexts (tags of neighbor words). The last formula is to calculate share of ambiguous_contexts, around 5%(0.04929...).

>>> test_tags = [tag for sent in brown.sents(categories='editorial')
...             for (word, tag) in t2.tag(sent)]
>>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
>>> print nltk.ConfusionMatrix(gold_tags, test_tags)
....

This matrix is to compare gold standard and assigned tags. However, the out put is too wide, likely more than 800 columns per row. Not easy to display...