Lookup tagger (5.4.3-5.4.4)

NLTK

Lookup tagger: >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.keys()[:100] >>> likely_tags = dict((word, cfd[word].max()) for word i…

2013-06-24

Automatic tagging (5.4-5.4.2)

NLTK

Start preparation: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') To check which tag is most frequently used. It's 'NN'. >>> tags = [tag for…

2013-06-23

Compounded keys and values (5.3.6-)

NLTK

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int)) >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged): ... pos[(t1, w2)][t2] += 1 ... >>…

2013-06-22

Incrementing dictionary values (5.3.5)

NLTK

Count numbers per tag. >>> counts = nltk.defaultdict(int) >>> from nltk.corpus import brown >>> for (word, tag) in brown.tagged_words(categories='news'): ... counts[tag] += 1 ... >>> counts['N'] 0 The result was different from the textbook…

2013-06-21

Define Dictionary (5.3.3-5.3.4)

NLTK

Defining a dictionary: >>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} >>> pos {'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'} >>> pos2 = dict(colorless='ADJ', sleep='V', ideas='N', furio…

2013-06-20

Mapping attributes to words by Python (5.3-5.3.2)

NLTK

Creating python dictionary: >>> pos = [] >>> pos [] >>> pos['colorless'] = 'ADJ' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str This is typical mistake. We sho…

2013-06-19

Non-simplified tags (5.2.7-)

NLTK

Analysing further detail in Nouns. This program shows top5 words in each type. >>> def findtags(tag_prefix, tagged_text): ... cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) ... retu…

2013-06-18

Simplified Tag set (5.2.3-5.2.6)

NLTK

Using simplified tag set: >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'DET…

2013-06-17

Pick up combination of Pingyin and Chinese characters

NLTK

It's time to check combination of pingyin and Chinese characters(Hanzi). First, I have created a new function named split_per_hanzi(). def split_per_hanzi(word_list): # ['你好','ni3 hao3'] --> [['ni3', '你'],['hao3', '好']] s_h…

2013-06-16

Corpus with tags (5.2-5.2.2)

NLTK

Tagged token is expressed as tuple. >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' Longer sample is here. >>> sent = ''' ... The/AT grand/JJ jury/NN comment…

2013-06-15

Use Tagger (5.1)

NLTK

Moving forward to Chapter 5 although Exercise of Chap 4 is still remaining. >>> text = nltk.word_tokenize("And now for somthing completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('somthing', 'VBG')…

2013-06-13

Exercise: Chapter 4 (1-2)

NLTK

1. Just showing the help documents. >>> help(str) >>> help(list) >>> help(tuple) 2. Just compare the two help documents. Tuple: Help on class tuple in module __builtin__:class tuple(object) | tuple() -> empty tuple | tuple(iterable) -> tup…

2013-06-13

Analysing Chinese words 2

NLTK

I missed ConditionalFreqDist since last article. >>> ccfd = nltk.ConditionalFreqDist((c,v) for (c, v, tone) in ping_elements) >>> ccfd.conditions() ['', 'b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',…

2013-06-13

CSV for unicode, Numpy (4.8.3-4.8.4)

NLTK

Handling CSV file; I have a good CSV file that was generated from my Chinese words learning database. There are 5800 words inside. >>> import csv, codecs >>> import_file = codecs.open("/Users/xxx/Documents/workspace/NLTK Learning/text file…

2013-06-12

Network X (4.8.2)

NLTK

When I tried, http://networkx.lanl.gov/ was not available. I used http://networkx.github.io to read documents instead.For installation, I dis as follows. sudo was required because of authorization problem. $ sudo pip install networkx Passw…

2013-06-11

Matplotlib (4.8.1)

NLTK

Matplotlib is not my first time to see. colors = 'rgbcmyk' #red, green, blue, cyan, magneta, yellow, black def bar_chart(categories, words, counts): "Plot a bar chart showing counts for each word by category" import pylab ind = pylab.arang…

2013-06-10

Accessing Chinese word database

NLTK

As I already mentioned previously, I have a Chinese word database which was created when I was learning Chinese. This database includes 5000+ words and mainly picked up from HSK Level 6 vocabulary.First I wrote some codes to process Pingyi…

2013-06-10

Algorithm design 2 (4.7.2-4.7.3)

NLTK

Improve search speed by building index. def raw(file): contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): #buggy text = ' ' * 30 + raw(…

2013-06-08

Algorithm Design (4.7.1)

NLTK

Recursion:Using for. >>> def factorial1(n): ... result = 1 ... for i in range(n): ... result *= (i + 1) ... return result ... >>> factorial1(3) 6 >>> factorial1(8) 40320 >>> factorial1(10) 3628800 The same logic can be realized with recurs…

2013-06-08

Structure of Python module (4.6)

NLTK

It took some time to find out the source code in my laptop when I faced strange behavior because I did't know this command. >>> nltk.metrics.distance.__file__ '/Library/Python/2.7/site-packages/nltk/metrics/distance.pyc' To get help of the…

2013-06-07

Japanese WordNet (12.1.5)

NLTK

There are seveal Japanse thesaurus, but many of them are not free of charge. Instead of them, Japanese WordNet is avaible here: http://nlpwww.nict.go.jp/wn-ja/First, I downloaded the file "wnjpn-all.tab" and put it under "nltk_data/wnjpn".…

2013-06-07

Accumulative Functions (4.5.3)

NLTK

Accumulative Functions at Chapter 4.5.3 of the whale book. filter() takes 2 parameters. The first one takes other functions and the second one is sequential data. The return will be elements in the sequential data which returns True by the…

2013-06-06

NLTK handling with Cygwin

NLTK

Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Us…

2013-06-05

Utilize Functions (4.5-4.5.2)

NLTK

In Python, function itself can be a parameter of functions. >>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', ... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] >>> def extract_property(prop): ... return [pro…

2013-06-05

Text processing with corpus (12.1.4)

NLTK

Get number of words and total length of words. >>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8') >>> print len(genpaku.words()) 733016 >>> >>> print sum(len(w) for w in genpaku.words…

2013-06-05

Check parameter type (4.4.4-4.4.6)

NLTK

It is not necessary to declare type of variables in Python. As a result, unexpected behavior might happen. >>> def tag(word): ... if word in ['a', 'the', 'all']: ... return 'det' ... else: ... return 'noun' ... >>> tag('the') 'det' >>> tag…

2013-06-04

Corpus with analized dependency structure (12.1.3)

NLTK

Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook. >>> from nltk.corpus.reader.knbc import * >>> from nltk.corpus.util import LazyCorpusLoader >>> root = nltk.data.find('corpora/knb…

2013-06-04

Functions (4.4.1-4.4.3)

NLTK

Learning about Function (Chapter 4.4)For reuse, create a function and save as a file. import re def get_text(file): """Read text from a file, normalizing whitespace and stripping HTML markup.""" text = open(file).read() text = re.sub('\s+'…

2013-06-03

Corpus with Tags (12.1.2)

NLTK

Import ChaSen: >>> from chasen import * Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named chasen >>> from nltk.corpus.reader.chasen import * According to the textbook, the corpus was …

2013-06-03

Coding style (4.3)

NLTK

This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.There are two codes are included to get the same result. >>> tokens = nltk.corpus.brown.words(categories='news'…