Performance limitation (5.5.7-5.5.8)

NLTK

>>> cfd = nltk.ConditionalFreqDist( ... ((x[1], y[1], z[0]), z[1]) ... for sent in brown_tagged_sents ... for x, y, z in nltk.trigrams(sent)) >>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1] >>> sum(cfd[c].N() for …

2013-06-29

Combine Taggers (5.5.4-5.5.5)

NLTK

In case tags cannot be assigned, it is possible to switch to more general tagger by using backoff option. >>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0) >>> t2 = nltk.BigramTagger(train_sents, backof…

2013-06-28

Generic N-gram tagger (5.3.3)

NLTK

Unigram tagger is to assign tags which are "probably" used. This is the restriction as each single word is focused in Unigram tagger. N-gramTagger is to check tags of neighbor words. >>> size = int(len(brown_tagged_sents) * 0.9) >>> train_…

2013-06-27

Separating Training and Test data (5.5.2)

NLTK

What a busy week! Today's topic is also short version. >>> size = int(len(brown_tagged_sents) * 0.9) >>> size 4160 >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagg…

2013-06-26

Unigram tagger (5.5.1)

NLTK

Today's article is short as too busy today!Unigram tagger: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') >>> unigram_tagger = nltk.UnigramT…

2013-06-25

Lookup tagger (5.4.3-5.4.4)

NLTK

Lookup tagger: >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.keys()[:100] >>> likely_tags = dict((word, cfd[word].max()) for word i…

2013-06-24

Automatic tagging (5.4-5.4.2)

NLTK

Start preparation: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') To check which tag is most frequently used. It's 'NN'. >>> tags = [tag for…

2013-06-24

■

2013-06-23

Compounded keys and values (5.3.6-)

NLTK

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int)) >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged): ... pos[(t1, w2)][t2] += 1 ... >>…

2013-06-22

Incrementing dictionary values (5.3.5)

NLTK

Count numbers per tag. >>> counts = nltk.defaultdict(int) >>> from nltk.corpus import brown >>> for (word, tag) in brown.tagged_words(categories='news'): ... counts[tag] += 1 ... >>> counts['N'] 0 The result was different from the textbook…

2013-06-21

Define Dictionary (5.3.3-5.3.4)

NLTK

Defining a dictionary: >>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} >>> pos {'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'} >>> pos2 = dict(colorless='ADJ', sleep='V', ideas='N', furio…

2013-06-20

Mapping attributes to words by Python (5.3-5.3.2)

NLTK

Creating python dictionary: >>> pos = [] >>> pos [] >>> pos['colorless'] = 'ADJ' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str This is typical mistake. We sho…

2013-06-19

Non-simplified tags (5.2.7-)

NLTK

Analysing further detail in Nouns. This program shows top5 words in each type. >>> def findtags(tag_prefix, tagged_text): ... cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) ... retu…

2013-06-18

Simplified Tag set (5.2.3-5.2.6)

NLTK

Using simplified tag set: >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'DET…

2013-06-17

Pick up combination of Pingyin and Chinese characters

NLTK

It's time to check combination of pingyin and Chinese characters(Hanzi). First, I have created a new function named split_per_hanzi(). def split_per_hanzi(word_list): # ['你好','ni3 hao3'] --> [['ni3', '你'],['hao3', '好']] s_h…

2013-06-17

■

2013-06-16

Corpus with tags (5.2-5.2.2)

NLTK

Tagged token is expressed as tuple. >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' Longer sample is here. >>> sent = ''' ... The/AT grand/JJ jury/NN comment…

2013-06-15

Use Tagger (5.1)

NLTK

Moving forward to Chapter 5 although Exercise of Chap 4 is still remaining. >>> text = nltk.word_tokenize("And now for somthing completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('somthing', 'VBG')…

2013-06-14

■

2013-06-13

Exercise: Chapter 4 (1-2)

NLTK

1. Just showing the help documents. >>> help(str) >>> help(list) >>> help(tuple) 2. Just compare the two help documents. Tuple: Help on class tuple in module __builtin__:class tuple(object) | tuple() -> empty tuple | tuple(iterable) -> tup…

2013-06-13

Analysing Chinese words 2

NLTK

I missed ConditionalFreqDist since last article. >>> ccfd = nltk.ConditionalFreqDist((c,v) for (c, v, tone) in ping_elements) >>> ccfd.conditions() ['', 'b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',…

2013-06-13

CSV for unicode, Numpy (4.8.3-4.8.4)

NLTK

Handling CSV file; I have a good CSV file that was generated from my Chinese words learning database. There are 5800 words inside. >>> import csv, codecs >>> import_file = codecs.open("/Users/xxx/Documents/workspace/NLTK Learning/text file…

2013-06-12

Network X (4.8.2)

NLTK

When I tried, http://networkx.lanl.gov/ was not available. I used http://networkx.github.io to read documents instead.For installation, I dis as follows. sudo was required because of authorization problem. $ sudo pip install networkx Passw…

2013-06-11

Matplotlib (4.8.1)

NLTK

Matplotlib is not my first time to see. colors = 'rgbcmyk' #red, green, blue, cyan, magneta, yellow, black def bar_chart(categories, words, counts): "Plot a bar chart showing counts for each word by category" import pylab ind = pylab.arang…

2013-06-10

Accessing Chinese word database

NLTK

As I already mentioned previously, I have a Chinese word database which was created when I was learning Chinese. This database includes 5000+ words and mainly picked up from HSK Level 6 vocabulary.First I wrote some codes to process Pingyi…

2013-06-10

Algorithm design 2 (4.7.2-4.7.3)

NLTK

Improve search speed by building index. def raw(file): contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): #buggy text = ' ' * 30 + raw(…

2013-06-08

Algorithm Design (4.7.1)

NLTK

Recursion:Using for. >>> def factorial1(n): ... result = 1 ... for i in range(n): ... result *= (i + 1) ... return result ... >>> factorial1(3) 6 >>> factorial1(8) 40320 >>> factorial1(10) 3628800 The same logic can be realized with recurs…

2013-06-08

Structure of Python module (4.6)

NLTK

It took some time to find out the source code in my laptop when I faced strange behavior because I did't know this command. >>> nltk.metrics.distance.__file__ '/Library/Python/2.7/site-packages/nltk/metrics/distance.pyc' To get help of the…

2013-06-07

Japanese WordNet (12.1.5)

NLTK

There are seveal Japanse thesaurus, but many of them are not free of charge. Instead of them, Japanese WordNet is avaible here: http://nlpwww.nict.go.jp/wn-ja/First, I downloaded the file "wnjpn-all.tab" and put it under "nltk_data/wnjpn".…

2013-06-07

Accumulative Functions (4.5.3)

NLTK

Accumulative Functions at Chapter 4.5.3 of the whale book. filter() takes 2 parameters. The first one takes other functions and the second one is sequential data. The return will be elements in the sequential data which returns True by the…

Deutschina's Tech Diary

Entries from 2013-06-01 to 1 month

Performance limitation (5.5.7-5.5.8)

Combine Taggers (5.5.4-5.5.5)

Generic N-gram tagger (5.3.3)

Separating Training and Test data (5.5.2)

Unigram tagger (5.5.1)

Lookup tagger (5.4.3-5.4.4)

Automatic tagging (5.4-5.4.2)

■

Compounded keys and values (5.3.6-)

Incrementing dictionary values (5.3.5)

Define Dictionary (5.3.3-5.3.4)

Mapping attributes to words by Python (5.3-5.3.2)

Non-simplified tags (5.2.7-)

Simplified Tag set (5.2.3-5.2.6)

Pick up combination of Pingyin and Chinese characters

■

Corpus with tags (5.2-5.2.2)

Use Tagger (5.1)

■

Exercise: Chapter 4 (1-2)

Analysing Chinese words 2

CSV for unicode, Numpy (4.8.3-4.8.4)

Network X (4.8.2)

Matplotlib (4.8.1)

Accessing Chinese word database

Algorithm design 2 (4.7.2-4.7.3)

Algorithm Design (4.7.1)

Structure of Python module (4.6)

Japanese WordNet (12.1.5)

Accumulative Functions (4.5.3)