Writing structured program (4.1)

NLTK

As I could not see the end of the exercises of Chapter 3, I decided to continue Chapter 4 of the whale book. Now, Go back to basic (chapter 4.1) >>> foo = "Monty" >>> bar = foo >>> foo = "Python" >>> bar 'Monty' >>> foo 'Python' I have see…

2013-05-30

■

2013-05-29

Exercise: Chapter 3 (18-21)

NLTK

18. >>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt') >>> words = nltk.word_tokenize(text) >>> list = sorted(set([w for w in words if re.search(r'^wh', w.lower())])) >>> for word in list: ... print word ... WHALE WHALE-FISHERY…

2013-05-29

■

2013-05-28

Exercise: Chapter 3 (14-17)

NLTK

14.Using words.sort(): >>> words =["banana", "pineapple", "peach", "apple", "orange", "mango", "maron", "nuts"] >>> words ['banana', 'pineapple', 'peach', 'apple', 'orange', 'mango', 'maron', 'nuts'] >>> words.sort() >>> words ['apple', 'b…

2013-05-27

Exercise: Chapter 3 (10-13)

NLTK

10.The original one is: >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] >>> result = [] >>> for word in sent: ... word_len = (word, len(word)) ... result.append(word_len) ... >>> result [('The', 3), ('dog', 3), ('gave', 4), (…

2013-05-27

Exercise: Chapter 3 (7-9)

NLTK

7. >>> nltk.re_show(r'\b(a|an|the)\b', 'brian a then an the man') brian {a} then {an} {the} man Usage of '\b' is the key point, I think.8. >>> import urllib >>> def cleantags(url): ... raw_contents = urllib.urlopen(url).read() ... return n…

2013-05-26

Exercise: Chapter 3 (1 - 6)

NLTK

1. >>> s = 'colorless' >>> print s[:4] + 'u' + s[4:] colourless 2. >>> 'dogs'[:-1] 'dog' >>> 'dishes'[:-2] 'dish' >>> 'running'[:-4] 'run' >>> 'nationality'[:-5] 'nation' >>> 'undo'[:-2] 'un' >>> 'undo'[2:] 'do' >>> 'preheat'[3:] 'heat' 3.…

2013-05-25

From list to string (3.9)

NLTK

Now in chapter 3.9 in the whale book.How to use join(). >>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.'] >>> ' '.join(silly) 'We called him Tortoise because he taught us .' >>> ';'.join(silly) 'We;calle…

2013-05-24

Segmentation (3.8)

NLTK

Continue on Segmentation, chapter 3.8 of the whale book.Some of corpora, for example, Brown corpus is accessible per sentence like this. >>> len(nltk.corpus.brown.words())/len(nltk.corpus.brown.sents()) 20.250994070456922 NLTK has a functi…

2013-05-22

Normalization for text tokenization (3.7)

NLTK

Chapter 3.7 of the whale book. >>> raw = """'When I'M a Duchess, 'she said to herself, (not in a very hopeful ... tone though), 'I wont't have any pepper in my kitchen AT ALL. Soup does very ... well without--Maybe it's always pepper that …

2013-05-22

Normalization of text (3.6)

NLTK

Chapter 3.6 of the whale book. >>> raw = """DENNIS: Listen, stange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farci…

2013-05-22

■

2013-05-21

■

2013-05-20

Find stem / Searching tokenized text (3.5.3-3.5.4)

NLTK

Let's continue. Chapter 3.5.3 of the whale book.One of examples of stemming. >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] .…

2013-05-19

Usage of regular expression (3.5.1 - 3.5.2)

NLTK

Chapter 3.5 of the whale book.Find out words which include a, e, i, o, u. >>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word) ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] >>>…

2013-05-18

Regular expression (3.4)

NLTK

Chapter 3.4 in the whale book. Preparation: >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] >>> Find words which end with "ed". >>> [w for w in wordlist if re.search('ed$', w)] ['abaissed', 'abandoned…

2013-05-17

Unicode text processing (3.3)

NLTK

In my case, I will handle double-byte languages like Japanese and Chinese. In terms of that, Unicode handling will be mandatory. Chapter 3.3 of the whale book is for Unicode handling. >>> path = nltk.data.find('corpora/unicode_samples/poli…

2013-05-17

Slicing text (3.2.3-3.2.6)

NLTK

Continue from yesterday as of chapter 3.2.3 of the whale book.Slicing can be used not only in list but also in text. This is already checked in chapter 1 as well. >>> print monty Monty Python >>> monty[0] 'M' >>> monty[3] 't' >>> monty[5] …

2013-05-16

Lowest level text processing (3.2.1-3.2.2)

NLTK

Now at chapter 3.2.1 of the whale book.This part is going back to the basic of text processing. We need to escape ('\') or double quotation(") in case single quotations (') are included in the text. >>> monty = 'Monty Python' >>> monty 'Mo…

2013-05-15

Processing RSS Feed (3.1.4) / Reading local file (3.1.5) and more

NLTK

Let's resume as of chapter 3.1.4 of the whale book. For RSS feed processing, Feed Parser is introduced. However, feedparser.org did not exist when I tried. Just I can find this site:http://code.google.com/p/feedparser/Although you can down…

2013-05-13

Processing HTML (3.1.2) / Search engine (3.1.3)

NLTK

Continuing as of chapter 3.1.2 in the whale book. >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN' Can display the sour…

2013-05-13

Accessing to text source (3.1.1)

NLTK

Now start Chapter 3 of the whale book. Let's import "Crime and Punishment" from the Gutenberg Ebook. >>> from __future__ import division >>> import nltk, re, pprint >>> >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/fil…

2013-05-12

Exercise: Chapter 2 (20-22)

NLTK

20. >>> def word_freq(word, section): ... fdist = FreqDist([w for w in nltk.corpus.brown.words(categories=section)]) ... return fdist.__getitem__(word) ... >>> word_freq('love', 'romance') 32 >>> word_freq('city', 'government') 7 >>> word_…

2013-05-12

■

2013-05-11

Exercise chapter 2 (16-19)

NLTK

16. >>> for category in nltk.corpus.brown.categories(): ... token = len(nltk.corpus.brown.words(categories=category)) ... vocab = len(set(nltk.corpus.brown.words(categories=category))) ... divst = token / vocab ... print category, token, v…

2013-05-09

Exercise Chapter2 (12-15)

NLTK

12. >>> entries = nltk.corpus.cmudict.entries() >>> len(entries) 133737 >>> words = [word for word, pron in entries] >>> len(words) 133737 >>> len(set(words)) 123455 >>> from __future__ import division >>> 1 - (len(set(words)) / len(words)…

2013-05-08

Installing NLTK (Mac)

NLTK

I had another chance to set up NLTK in other Mac (Mountain Lion). Here I leave logs... Kens-Macbook-Air-2010:~ ken$ easy_install pip Searching for pip Best match: pip 1.3.1 Processing pip-1.3.1-py2.7.egg pip 1.3.1 is already the active ver…

2013-05-08

Exercise: Chapter 2 (8-11)

NLTK

8. >>> names = nltk.corpus.names >>> names.fileids() ['female.txt', 'male.txt'] >>> cfd = nltk.ConditionalFreqDist( ... (fileid, name[0]) ... for fileid in names.fileids() ... for name in names.words(fileid)) >>> cfd.plot() There are more …

2013-05-07

Exercise: Chapter 2 (1-7)

NLTK

Although it took a long time, now I reached to the end of Chapter 2 in the whale book. 1. >>> words1 = ['green', 'yellow', 'red', 'white', 'black'] >>> words2 = ['pink', 'brown'] >>> words3 = words1 + words2 >>> words3 ['green', 'yellow', …

Deutschina's Tech Diary

Entries from 2013-05-01 to 1 month

Writing structured program (4.1)

■

Exercise: Chapter 3 (18-21)

■

Exercise: Chapter 3 (14-17)

Exercise: Chapter 3 (10-13)

Exercise: Chapter 3 (7-9)

Exercise: Chapter 3 (1 - 6)

From list to string (3.9)

Segmentation (3.8)

Normalization for text tokenization (3.7)

Normalization of text (3.6)

■

■

Find stem / Searching tokenized text (3.5.3-3.5.4)

Usage of regular expression (3.5.1 - 3.5.2)

Regular expression (3.4)

Unicode text processing (3.3)

Slicing text (3.2.3-3.2.6)

Lowest level text processing (3.2.1-3.2.2)

Processing RSS Feed (3.1.4) / Reading local file (3.1.5) and more

Processing HTML (3.1.2) / Search engine (3.1.3)

Accessing to text source (3.1.1)

Exercise: Chapter 2 (20-22)

■

Exercise chapter 2 (16-19)

Exercise Chapter2 (12-15)

Installing NLTK (Mac)

Exercise: Chapter 2 (8-11)

Exercise: Chapter 2 (1-7)