Entries from 2013-05-01 to 1 month

Writing structured program (4.1)

As I could not see the end of the exercises of Chapter 3, I decided to continue Chapter 4 of the whale book. Now, Go back to basic (chapter 4.1) >>> foo = "Monty" >>> bar = foo >>> foo = "Python" >>> bar 'Monty' >>> foo 'Python' I have see…

Exercise: Chapter 3 (18-21)

18. >>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt') >>> words = nltk.word_tokenize(text) >>> list = sorted(set([w for w in words if re.search(r'^wh', w.lower())])) >>> for word in list: ... print word ... WHALE WHALE-FISHERY…

Exercise: Chapter 3 (14-17)

14.Using words.sort(): >>> words =["banana", "pineapple", "peach", "apple", "orange", "mango", "maron", "nuts"] >>> words ['banana', 'pineapple', 'peach', 'apple', 'orange', 'mango', 'maron', 'nuts'] >>> words.sort() >>> words ['apple', 'b…

Exercise: Chapter 3 (10-13)

10.The original one is: >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] >>> result = [] >>> for word in sent: ... word_len = (word, len(word)) ... result.append(word_len) ... >>> result [('The', 3), ('dog', 3), ('gave', 4), (…

Exercise: Chapter 3 (7-9)

7. >>> nltk.re_show(r'\b(a|an|the)\b', 'brian a then an the man') brian {a} then {an} {the} man Usage of '\b' is the key point, I think.8. >>> import urllib >>> def cleantags(url): ... raw_contents = urllib.urlopen(url).read() ... return n…

Exercise: Chapter 3 (1 - 6)

1. >>> s = 'colorless' >>> print s[:4] + 'u' + s[4:] colourless 2. >>> 'dogs'[:-1] 'dog' >>> 'dishes'[:-2] 'dish' >>> 'running'[:-4] 'run' >>> 'nationality'[:-5] 'nation' >>> 'undo'[:-2] 'un' >>> 'undo'[2:] 'do' >>> 'preheat'[3:] 'heat' 3.…

From list to string (3.9)

Now in chapter 3.9 in the whale book.How to use join(). >>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.'] >>> ' '.join(silly) 'We called him Tortoise because he taught us .' >>> ';'.join(silly) 'We;calle…

Segmentation (3.8)

Continue on Segmentation, chapter 3.8 of the whale book.Some of corpora, for example, Brown corpus is accessible per sentence like this. >>> len(nltk.corpus.brown.words())/len(nltk.corpus.brown.sents()) 20.250994070456922 NLTK has a functi…

Normalization for text tokenization (3.7)

Chapter 3.7 of the whale book. >>> raw = """'When I'M a Duchess, 'she said to herself, (not in a very hopeful ... tone though), 'I wont't have any pepper in my kitchen AT ALL. Soup does very ... well without--Maybe it's always pepper that …

Normalization of text (3.6)

Chapter 3.6 of the whale book. >>> raw = """DENNIS: Listen, stange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farci…

Find stem / Searching tokenized text (3.5.3-3.5.4)

Let's continue. Chapter 3.5.3 of the whale book.One of examples of stemming. >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] .…

Usage of regular expression (3.5.1 - 3.5.2)

Chapter 3.5 of the whale book.Find out words which include a, e, i, o, u. >>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word) ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] >>>…

Regular expression (3.4)

Chapter 3.4 in the whale book. Preparation: >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] >>> Find words which end with "ed". >>> [w for w in wordlist if re.search('ed$', w)] ['abaissed', 'abandoned…

Unicode text processing (3.3)

In my case, I will handle double-byte languages like Japanese and Chinese. In terms of that, Unicode handling will be mandatory. Chapter 3.3 of the whale book is for Unicode handling. >>> path = nltk.data.find('corpora/unicode_samples/poli…

Slicing text (3.2.3-3.2.6)

Continue from yesterday as of chapter 3.2.3 of the whale book.Slicing can be used not only in list but also in text. This is already checked in chapter 1 as well. >>> print monty Monty Python >>> monty[0] 'M' >>> monty[3] 't' >>> monty[5] …

Lowest level text processing (3.2.1-3.2.2)

Now at chapter 3.2.1 of the whale book.This part is going back to the basic of text processing. We need to escape ('\') or double quotation(") in case single quotations (') are included in the text. >>> monty = 'Monty Python' >>> monty 'Mo…

Processing RSS Feed (3.1.4) / Reading local file (3.1.5) and more

Let's resume as of chapter 3.1.4 of the whale book. For RSS feed processing, Feed Parser is introduced. However, feedparser.org did not exist when I tried. Just I can find this site:http://code.google.com/p/feedparser/Although you can down…

Processing HTML (3.1.2) / Search engine (3.1.3)

Continuing as of chapter 3.1.2 in the whale book. >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN' Can display the sour…

Accessing to text source (3.1.1)

Now start Chapter 3 of the whale book. Let's import "Crime and Punishment" from the Gutenberg Ebook. >>> from __future__ import division >>> import nltk, re, pprint >>> >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/fil…

Exercise: Chapter 2 (20-22)

20. >>> def word_freq(word, section): ... fdist = FreqDist([w for w in nltk.corpus.brown.words(categories=section)]) ... return fdist.__getitem__(word) ... >>> word_freq('love', 'romance') 32 >>> word_freq('city', 'government') 7 >>> word_…

Exercise chapter 2 (16-19)

16. >>> for category in nltk.corpus.brown.categories(): ... token = len(nltk.corpus.brown.words(categories=category)) ... vocab = len(set(nltk.corpus.brown.words(categories=category))) ... divst = token / vocab ... print category, token, v…

Exercise Chapter2 (12-15)

12. >>> entries = nltk.corpus.cmudict.entries() >>> len(entries) 133737 >>> words = [word for word, pron in entries] >>> len(words) 133737 >>> len(set(words)) 123455 >>> from __future__ import division >>> 1 - (len(set(words)) / len(words)…

Installing NLTK (Mac)

I had another chance to set up NLTK in other Mac (Mountain Lion). Here I leave logs... Kens-Macbook-Air-2010:~ ken$ easy_install pip Searching for pip Best match: pip 1.3.1 Processing pip-1.3.1-py2.7.egg pip 1.3.1 is already the active ver…

Exercise: Chapter 2 (8-11)

8. >>> names = nltk.corpus.names >>> names.fileids() ['female.txt', 'male.txt'] >>> cfd = nltk.ConditionalFreqDist( ... (fileid, name[0]) ... for fileid in names.fileids() ... for name in names.words(fileid)) >>> cfd.plot() There are more …

Exercise: Chapter 2 (1-7)

Although it took a long time, now I reached to the end of Chapter 2 in the whale book. 1. >>> words1 = ['green', 'yellow', 'red', 'white', 'black'] >>> words2 = ['pink', 'brown'] >>> words3 = words1 + words2 >>> words3 ['green', 'yellow', …