Entries from 2013-06-01 to 1 month

Performance limitation (5.5.7-5.5.8)

>>> cfd = nltk.ConditionalFreqDist( ... ((x[1], y[1], z[0]), z[1]) ... for sent in brown_tagged_sents ... for x, y, z in nltk.trigrams(sent)) >>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1] >>> sum(cfd[c].N() for …

Combine Taggers (5.5.4-5.5.5)

In case tags cannot be assigned, it is possible to switch to more general tagger by using backoff option. >>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0) >>> t2 = nltk.BigramTagger(train_sents, backof…

Generic N-gram tagger (5.3.3)

Unigram tagger is to assign tags which are "probably" used. This is the restriction as each single word is focused in Unigram tagger. N-gramTagger is to check tags of neighbor words. >>> size = int(len(brown_tagged_sents) * 0.9) >>> train_…

Separating Training and Test data (5.5.2)

What a busy week! Today's topic is also short version. >>> size = int(len(brown_tagged_sents) * 0.9) >>> size 4160 >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagg…

Unigram tagger (5.5.1)

Today's article is short as too busy today!Unigram tagger: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') >>> unigram_tagger = nltk.UnigramT…

Lookup tagger (5.4.3-5.4.4)

Lookup tagger: >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.keys()[:100] >>> likely_tags = dict((word, cfd[word].max()) for word i…

Automatic tagging (5.4-5.4.2)

Start preparation: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') To check which tag is most frequently used. It's 'NN'. >>> tags = [tag for…

Compounded keys and values (5.3.6-)

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int)) >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged): ... pos[(t1, w2)][t2] += 1 ... >>…

Incrementing dictionary values (5.3.5)

Count numbers per tag. >>> counts = nltk.defaultdict(int) >>> from nltk.corpus import brown >>> for (word, tag) in brown.tagged_words(categories='news'): ... counts[tag] += 1 ... >>> counts['N'] 0 The result was different from the textbook…

Define Dictionary (5.3.3-5.3.4)

Defining a dictionary: >>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} >>> pos {'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'} >>> pos2 = dict(colorless='ADJ', sleep='V', ideas='N', furio…

Mapping attributes to words by Python (5.3-5.3.2)

Creating python dictionary: >>> pos = [] >>> pos [] >>> pos['colorless'] = 'ADJ' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str This is typical mistake. We sho…

Non-simplified tags (5.2.7-)

Analysing further detail in Nouns. This program shows top5 words in each type. >>> def findtags(tag_prefix, tagged_text): ... cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) ... retu…

Simplified Tag set (5.2.3-5.2.6)

Using simplified tag set: >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'DET…

Pick up combination of Pingyin and Chinese characters

It's time to check combination of pingyin and Chinese characters(Hanzi). First, I have created a new function named split_per_hanzi(). def split_per_hanzi(word_list): # ['你好','ni3 hao3'] --> [['ni3', '你'],['hao3', '好']] s_h…

Corpus with tags (5.2-5.2.2)

Tagged token is expressed as tuple. >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' Longer sample is here. >>> sent = ''' ... The/AT grand/JJ jury/NN comment…

Use Tagger (5.1)

Moving forward to Chapter 5 although Exercise of Chap 4 is still remaining. >>> text = nltk.word_tokenize("And now for somthing completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('somthing', 'VBG')…

Exercise: Chapter 4 (1-2)

1. Just showing the help documents. >>> help(str) >>> help(list) >>> help(tuple) 2. Just compare the two help documents. Tuple: Help on class tuple in module __builtin__:class tuple(object) | tuple() -> empty tuple | tuple(iterable) -> tup…

Analysing Chinese words 2

I missed ConditionalFreqDist since last article. >>> ccfd = nltk.ConditionalFreqDist((c,v) for (c, v, tone) in ping_elements) >>> ccfd.conditions() ['', 'b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',…

CSV for unicode, Numpy (4.8.3-4.8.4)

Handling CSV file; I have a good CSV file that was generated from my Chinese words learning database. There are 5800 words inside. >>> import csv, codecs >>> import_file = codecs.open("/Users/xxx/Documents/workspace/NLTK Learning/text file…

Network X (4.8.2)

When I tried, http://networkx.lanl.gov/ was not available. I used http://networkx.github.io to read documents instead.For installation, I dis as follows. sudo was required because of authorization problem. $ sudo pip install networkx Passw…

Matplotlib (4.8.1)

Matplotlib is not my first time to see. colors = 'rgbcmyk' #red, green, blue, cyan, magneta, yellow, black def bar_chart(categories, words, counts): "Plot a bar chart showing counts for each word by category" import pylab ind = pylab.arang…

Accessing Chinese word database

As I already mentioned previously, I have a Chinese word database which was created when I was learning Chinese. This database includes 5000+ words and mainly picked up from HSK Level 6 vocabulary.First I wrote some codes to process Pingyi…

Algorithm design 2 (4.7.2-4.7.3)

Improve search speed by building index. def raw(file): contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): #buggy text = ' ' * 30 + raw(…

Algorithm Design (4.7.1)

Recursion:Using for. >>> def factorial1(n): ... result = 1 ... for i in range(n): ... result *= (i + 1) ... return result ... >>> factorial1(3) 6 >>> factorial1(8) 40320 >>> factorial1(10) 3628800 The same logic can be realized with recurs…

Structure of Python module (4.6)

It took some time to find out the source code in my laptop when I faced strange behavior because I did't know this command. >>> nltk.metrics.distance.__file__ '/Library/Python/2.7/site-packages/nltk/metrics/distance.pyc' To get help of the…

Japanese WordNet (12.1.5)

There are seveal Japanse thesaurus, but many of them are not free of charge. Instead of them, Japanese WordNet is avaible here: http://nlpwww.nict.go.jp/wn-ja/First, I downloaded the file "wnjpn-all.tab" and put it under "nltk_data/wnjpn".…

Accumulative Functions (4.5.3)

Accumulative Functions at Chapter 4.5.3 of the whale book. filter() takes 2 parameters. The first one takes other functions and the second one is sequential data. The return will be elements in the sequential data which returns True by the…