NLTK

Lookup tagger (5.4.3-5.4.4)

Lookup tagger: >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.keys()[:100] >>> likely_tags = dict((word, cfd[word].max()) for word i…

Automatic tagging (5.4-5.4.2)

Start preparation: >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') To check which tag is most frequently used. It's 'NN'. >>> tags = [tag for…

Compounded keys and values (5.3.6-)

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int)) >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged): ... pos[(t1, w2)][t2] += 1 ... >>…

Incrementing dictionary values (5.3.5)

Count numbers per tag. >>> counts = nltk.defaultdict(int) >>> from nltk.corpus import brown >>> for (word, tag) in brown.tagged_words(categories='news'): ... counts[tag] += 1 ... >>> counts['N'] 0 The result was different from the textbook…

Define Dictionary (5.3.3-5.3.4)

Defining a dictionary: >>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} >>> pos {'furiously': 'ADV', 'sleep': 'V', 'ideas': 'N', 'colorless': 'ADJ'} >>> pos2 = dict(colorless='ADJ', sleep='V', ideas='N', furio…

Mapping attributes to words by Python (5.3-5.3.2)

Creating python dictionary: >>> pos = [] >>> pos [] >>> pos['colorless'] = 'ADJ' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str This is typical mistake. We sho…

Non-simplified tags (5.2.7-)

Analysing further detail in Nouns. This program shows top5 words in each type. >>> def findtags(tag_prefix, tagged_text): ... cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) ... retu…

Simplified Tag set (5.2.3-5.2.6)

Using simplified tag set: >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'DET…

Pick up combination of Pingyin and Chinese characters

It's time to check combination of pingyin and Chinese characters(Hanzi). First, I have created a new function named split_per_hanzi(). def split_per_hanzi(word_list): # ['你好','ni3 hao3'] --> [['ni3', '你'],['hao3', '好']] s_h…

Corpus with tags (5.2-5.2.2)

Tagged token is expressed as tuple. >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' Longer sample is here. >>> sent = ''' ... The/AT grand/JJ jury/NN comment…

Use Tagger (5.1)

Moving forward to Chapter 5 although Exercise of Chap 4 is still remaining. >>> text = nltk.word_tokenize("And now for somthing completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('somthing', 'VBG')…

Exercise: Chapter 4 (1-2)

1. Just showing the help documents. >>> help(str) >>> help(list) >>> help(tuple) 2. Just compare the two help documents. Tuple: Help on class tuple in module __builtin__:class tuple(object) | tuple() -> empty tuple | tuple(iterable) -> tup…

Analysing Chinese words 2

I missed ConditionalFreqDist since last article. >>> ccfd = nltk.ConditionalFreqDist((c,v) for (c, v, tone) in ping_elements) >>> ccfd.conditions() ['', 'b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',…

CSV for unicode, Numpy (4.8.3-4.8.4)

Handling CSV file; I have a good CSV file that was generated from my Chinese words learning database. There are 5800 words inside. >>> import csv, codecs >>> import_file = codecs.open("/Users/xxx/Documents/workspace/NLTK Learning/text file…

Network X (4.8.2)

When I tried, http://networkx.lanl.gov/ was not available. I used http://networkx.github.io to read documents instead.For installation, I dis as follows. sudo was required because of authorization problem. $ sudo pip install networkx Passw…

Matplotlib (4.8.1)

Matplotlib is not my first time to see. colors = 'rgbcmyk' #red, green, blue, cyan, magneta, yellow, black def bar_chart(categories, words, counts): "Plot a bar chart showing counts for each word by category" import pylab ind = pylab.arang…

Accessing Chinese word database

As I already mentioned previously, I have a Chinese word database which was created when I was learning Chinese. This database includes 5000+ words and mainly picked up from HSK Level 6 vocabulary.First I wrote some codes to process Pingyi…

Algorithm design 2 (4.7.2-4.7.3)

Improve search speed by building index. def raw(file): contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): #buggy text = ' ' * 30 + raw(…

Algorithm Design (4.7.1)

Recursion:Using for. >>> def factorial1(n): ... result = 1 ... for i in range(n): ... result *= (i + 1) ... return result ... >>> factorial1(3) 6 >>> factorial1(8) 40320 >>> factorial1(10) 3628800 The same logic can be realized with recurs…

Structure of Python module (4.6)

It took some time to find out the source code in my laptop when I faced strange behavior because I did't know this command. >>> nltk.metrics.distance.__file__ '/Library/Python/2.7/site-packages/nltk/metrics/distance.pyc' To get help of the…

Japanese WordNet (12.1.5)

There are seveal Japanse thesaurus, but many of them are not free of charge. Instead of them, Japanese WordNet is avaible here: http://nlpwww.nict.go.jp/wn-ja/First, I downloaded the file "wnjpn-all.tab" and put it under "nltk_data/wnjpn".…

Accumulative Functions (4.5.3)

Accumulative Functions at Chapter 4.5.3 of the whale book. filter() takes 2 parameters. The first one takes other functions and the second one is sequential data. The return will be elements in the sequential data which returns True by the…

NLTK handling with Cygwin

Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Us…

Utilize Functions (4.5-4.5.2)

In Python, function itself can be a parameter of functions. >>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', ... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] >>> def extract_property(prop): ... return [pro…

Text processing with corpus (12.1.4)

Get number of words and total length of words. >>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8') >>> print len(genpaku.words()) 733016 >>> >>> print sum(len(w) for w in genpaku.words…

Check parameter type (4.4.4-4.4.6)

It is not necessary to declare type of variables in Python. As a result, unexpected behavior might happen. >>> def tag(word): ... if word in ['a', 'the', 'all']: ... return 'det' ... else: ... return 'noun' ... >>> tag('the') 'det' >>> tag…

Corpus with analized dependency structure (12.1.3)

Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook. >>> from nltk.corpus.reader.knbc import * >>> from nltk.corpus.util import LazyCorpusLoader >>> root = nltk.data.find('corpora/knb…

Functions (4.4.1-4.4.3)

Learning about Function (Chapter 4.4)For reuse, create a function and save as a file. import re def get_text(file): """Read text from a file, normalizing whitespace and stripping HTML markup.""" text = open(file).read() text = re.sub('\s+'…

Corpus with Tags (12.1.2)

Import ChaSen: >>> from chasen import * Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named chasen >>> from nltk.corpus.reader.chasen import * According to the textbook, the corpus was …

Coding style (4.3)

This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.There are two codes are included to get the same result. >>> tokens = nltk.corpus.brown.words(categories='news'…