Performance limitation (5.5.7-5.5.8)
>>> cfd = nltk.ConditionalFreqDist( ... ((x[1], y[1], z[0]), z[1]) ... for sent in brown_tagged_sents ... for x, y, z in nltk.trigrams(sent)) >>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1] >>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N() 0 >>> from __future__ import division >>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N() 0.049297702068029296
In ambiguous_contexts, more than one different tags are assigned even under the same contexts (tags of neighbor words). The last formula is to calculate share of ambiguous_contexts, around 5%(0.04929...).
>>> test_tags = [tag for sent in brown.sents(categories='editorial') ... for (word, tag) in t2.tag(sent)] >>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')] >>> print nltk.ConfusionMatrix(gold_tags, test_tags) ....
This matrix is to compare gold standard and assigned tags. However, the out put is too wide, likely more than 800 columns per row. Not easy to display...