NLTK handling with Cygwin

NLTK

Chapter 12 of Japanese edition of the whalebook describes how to handling Japanese languages. It seems to go well with my Mac environment (Mountain Lion), however, I faced several character corruption issues in my Windows 7 envrionment. Us…

2013-06-05

Utilize Functions (4.5-4.5.2)

NLTK

In Python, function itself can be a parameter of functions. >>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', ... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] >>> def extract_property(prop): ... return [pro…

2013-06-05

Text processing with corpus (12.1.4)

NLTK

Get number of words and total length of words. >>> genpaku = ChasenCorpusReader('C:/Users/xxxxxxx/AppData/Roaming/nltk_data/jeita', 'g.*chasen', 'utf-8') >>> print len(genpaku.words()) 733016 >>> >>> print sum(len(w) for w in genpaku.words…

2013-06-05

Check parameter type (4.4.4-4.4.6)

NLTK

It is not necessary to declare type of variables in Python. As a result, unexpected behavior might happen. >>> def tag(word): ... if word in ['a', 'the', 'all']: ... return 'det' ... else: ... return 'noun' ... >>> tag('the') 'det' >>> tag…

2013-06-04

Corpus with analized dependency structure (12.1.3)

NLTK

Start from importing KNBC. Should be careful as there are some small mistakes in the sample of the textbook. >>> from nltk.corpus.reader.knbc import * >>> from nltk.corpus.util import LazyCorpusLoader >>> root = nltk.data.find('corpora/knb…

2013-06-04

Functions (4.4.1-4.4.3)

NLTK

Learning about Function (Chapter 4.4)For reuse, create a function and save as a file. import re def get_text(file): """Read text from a file, normalizing whitespace and stripping HTML markup.""" text = open(file).read() text = re.sub('\s+'…

2013-06-03

Corpus with Tags (12.1.2)

NLTK

Import ChaSen: >>> from chasen import * Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named chasen >>> from nltk.corpus.reader.chasen import * According to the textbook, the corpus was …

2013-06-03

Coding style (4.3)

NLTK

This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.There are two codes are included to get the same result. >>> tokens = nltk.corpus.brown.words(categories='news'…

2013-06-02

Japanese corpus (12.1.1)

NLTK

>>> import nltk >>> from nltk.corpus.reader import * >>> from nltk.corpus.reader.util import * >>> from nltk.text import Text >>> >>> jp_sent_tokenizer = nltk.RegexpTokenizer(u' 「」！？。]*[！？。]') >>> jp_chartype_tokenizer = nltk.Regex…

2013-06-02

Combining Different Sequence Types (4.2.2-4.2.3)

NLTK

Let's continue. >>> words = 'I turned off the spectroroute'.split() >>> wordlens = [(len(word), word) for word in words] >>> wordlens.sort() >>> ' '.join(w for(_, w) in wordlens) 'I off the turned spectroroute' >>> The first line is to spl…

2013-06-02

■

2013-06-01

How to handle Japanse with Python (12)

NLTK

This is about chapter 12 of the whale book. This capter is only available in Japanease version. I still keep to write in English as this might be helpful for other double-byte character languages.Of course, I will continue other chapters (…

2013-06-01

Sequence (4.2.1)

NLTK

Chapter 4 of the whale book looks like grammar review of pythons. Go to next section 4.2. Using tuple. >>> t = 'walk', 'fem', 3 >>> t ('walk', 'fem', 3) >>> t[0] 'walk' >>> t[1:] ('fem', 3) >>> len(t) 3 >>> raw = 'I truned off the spectror…

2013-06-01

Deutschina's Tech Diary

Entries from 2013-06-01 to 1 month

NLTK handling with Cygwin

Utilize Functions (4.5-4.5.2)

Text processing with corpus (12.1.4)

Check parameter type (4.4.4-4.4.6)

Corpus with analized dependency structure (12.1.3)

Functions (4.4.1-4.4.3)

Corpus with Tags (12.1.2)

Coding style (4.3)

Japanese corpus (12.1.1)

Combining Different Sequence Types (4.2.2-4.2.3)

■

How to handle Japanse with Python (12)

Sequence (4.2.1)

■