Entries from 2013-06-01 to 1 month

NLTK handling with Cygwin

Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Us…

Utilize Functions (4.5-4.5.2)

In Python, function itself can be a parameter of functions. >>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', ... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] >>> def extract_property(prop): ... return [pro…

Text processing with corpus (12.1.4)

Get number of words and total length of words. >>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8') >>> print len(genpaku.words()) 733016 >>> >>> print sum(len(w) for w in genpaku.words…

Check parameter type (4.4.4-4.4.6)

It is not necessary to declare type of variables in Python. As a result, unexpected behavior might happen. >>> def tag(word): ... if word in ['a', 'the', 'all']: ... return 'det' ... else: ... return 'noun' ... >>> tag('the') 'det' >>> tag…

Corpus with analized dependency structure (12.1.3)

Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook. >>> from nltk.corpus.reader.knbc import * >>> from nltk.corpus.util import LazyCorpusLoader >>> root = nltk.data.find('corpora/knb…

Functions (4.4.1-4.4.3)

Learning about Function (Chapter 4.4)For reuse, create a function and save as a file. import re def get_text(file): """Read text from a file, normalizing whitespace and stripping HTML markup.""" text = open(file).read() text = re.sub('\s+'…

Corpus with Tags (12.1.2)

Import ChaSen: >>> from chasen import * Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named chasen >>> from nltk.corpus.reader.chasen import * According to the textbook, the corpus was …

Coding style (4.3)

This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.There are two codes are included to get the same result. >>> tokens = nltk.corpus.brown.words(categories='news'…

Japanese corpus (12.1.1)

>>> import nltk >>> from nltk.corpus.reader import * >>> from nltk.corpus.reader.util import * >>> from nltk.text import Text >>> >>> jp_sent_tokenizer = nltk.RegexpTokenizer(u' 「」!?。]*[!?。]') >>> jp_chartype_tokenizer = nltk.Regex…

Combining Different Sequence Types (4.2.2-4.2.3)

Let's continue. >>> words = 'I turned off the spectroroute'.split() >>> wordlens = [(len(word), word) for word in words] >>> wordlens.sort() >>> ' '.join(w for(_, w) in wordlens) 'I off the turned spectroroute' >>> The first line is to spl…

How to handle Japanse with Python (12)

This is about chapter 12 of the whale book. This capter is only available in Japanease version. I still keep to write in English as this might be helpful for other double-byte character languages.Of course, I will continue other chapters (…

Sequence (4.2.1)

Chapter 4 of the whale book looks like grammar review of pythons. Go to next section 4.2. Using tuple. >>> t = 'walk', 'fem', 3 >>> t ('walk', 'fem', 3) >>> t[0] 'walk' >>> t[1:] ('fem', 3) >>> len(t) 3 >>> raw = 'I truned off the spectror…