Entries from 2013-06-01 to 1 month
Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Us…
In Python, function itself can be a parameter of functions. >>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', ... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] >>> def extract_property(prop): ... return [pro…
Get number of words and total length of words. >>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8') >>> print len(genpaku.words()) 733016 >>> >>> print sum(len(w) for w in genpaku.words…
It is not necessary to declare type of variables in Python. As a result, unexpected behavior might happen. >>> def tag(word): ... if word in ['a', 'the', 'all']: ... return 'det' ... else: ... return 'noun' ... >>> tag('the') 'det' >>> tag…
Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook. >>> from nltk.corpus.reader.knbc import * >>> from nltk.corpus.util import LazyCorpusLoader >>> root = nltk.data.find('corpora/knb…
Learning about Function (Chapter 4.4)For reuse, create a function and save as a file. import re def get_text(file): """Read text from a file, normalizing whitespace and stripping HTML markup.""" text = open(file).read() text = re.sub('\s+'…
Import ChaSen: >>> from chasen import * Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named chasen >>> from nltk.corpus.reader.chasen import * According to the textbook, the corpus was …
This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.There are two codes are included to get the same result. >>> tokens = nltk.corpus.brown.words(categories='news'…
>>> import nltk >>> from nltk.corpus.reader import * >>> from nltk.corpus.reader.util import * >>> from nltk.text import Text >>> >>> jp_sent_tokenizer = nltk.RegexpTokenizer(u' 「」!?。]*[!?。]') >>> jp_chartype_tokenizer = nltk.Regex…
Let's continue. >>> words = 'I turned off the spectroroute'.split() >>> wordlens = [(len(word), word) for word in words] >>> wordlens.sort() >>> ' '.join(w for(_, w) in wordlens) 'I off the turned spectroroute' >>> The first line is to spl…
This is about chapter 12 of the whale book. This capter is only available in Japanease version. I still keep to write in English as this might be helpful for other double-byte character languages.Of course, I will continue other chapters (…
Chapter 4 of the whale book looks like grammar review of pythons. Go to next section 4.2. Using tuple. >>> t = 'walk', 'fem', 3 >>> t ('walk', 'fem', 3) >>> t[0] 'walk' >>> t[1:] ('fem', 3) >>> len(t) 3 >>> raw = 'I truned off the spectror…