Algorithm design 2 (4.7.2-4.7.3) - Deutschina's Tech Diary

Improve search speed by building index.

def raw(file):
     contents = open(file).read()
     contents = re.sub(r'&lt;.*?>', ' ', contents)
     contents = re.sub('\s+', ' ', contents)
     return contents
 
def snippet(doc, term):     #buggy
     text = ' ' * 30 + raw(doc) + ' ' * 30
     pos = text.index(term)
     return text[pos-30:pos+30]

import nltk, re

files = nltk.corpus.movie_reviews.abspaths()
idx = nltk.Index((w, f) for f in files for w in raw(f).split())

This could be executed only once from the file, I entered command as following:

>>> query = ''
>>> while query != "quit":
...     query = raw_input("query> ")
...     if query in idx:
...             for doc in idx[query]:
...                     print snippet(doc, query)
...     else:
...             print "not found..."
... 

query> japan
s . evil ninjas in modern-day japan . . . and nobody wins . 
s . evil ninjas in modern-day japan . . . and nobody wins . 
ilm made outside his familial japan , bringing the yakuza tr
h gangster movies made * in * japan , there are whole gobs o
ot to erase " the data " from japan , a desire to clone and 
ew people can tolerate ) from japan that turn up really late
mated sci-fi movies come from japan , because " titan a . e 
the first ten minutes of this japanese film , you will never
eted , they will be sold to a japanese toy museum , and fina
tor of last year's best-known japanese animated film import 
tor of last year's best-known japanese animated film import 
tor of last year's best-known japanese animated film import 
tor of last year's best-known japanese animated film import 
rector : juzo itami country : japan 1997 cinematography : yo
cond machine off the coast of japan and has reserved the rid
 mind , please tell me . this japanese flick is supposed to 
amurai's honor code of feudal japan against the tradition bo
d chapter of wwii , life in a japanese prison camp . to end 
d chapter of wwii , life in a japanese prison camp . to end 
grammer ) -to a toy museum in japan . the gang is happy abou
grammer ) -to a toy museum in japan . the gang is happy abou
everything to a toy museum in japan for a huge profit . so ,
everything to a toy museum in japan for a huge profit . so ,
query>

BTW, I tried to quit, the results was like:

query> quit
s funded by her mother . lucy quit working professionally 10
erick . i disliked that movie quite a bit , but since " prac
t disaster . babe ruth didn't quit baseball after one season
o-be fiance . i think she can quit that job and get a more r
 and rose mcgowan should just quit acting . she has no chari
and get a day job . and don't quit it .                     
 kubrick , alas , should have quit while he was ahead . this
everyone involved should have quit while they were still ahe
l die . so what does joe do ? quit his job , of course ! ! w
red " implant . he's ready to quit the biz and get a portion
hat he always recorded , they quit and become disillusioned 
 admit that i ? ? ? ve become quite the " scream " fan . no 
 again , the fact that he has quit his job to feel what it's
school reunion . he has since quit his job as a travel journ
ells one of his friends , " i quit school because i didn't l
ms , cursing off the boss and quitting his job ( " today i q
e , the arrival of the now ubiquitous videocassette . burt r
in capitol city , that he has quit his job and hopes to open
before his death at age 67 to quit filmmaking once a homosex
 - joss's explanation that he quit the priesthood because of
 is a former prosecutor , and quit because of tensions betwe

This is to create dictionally.

>>> def preprocess(tagged_corpus):
...     words = set()
...     tags = set()
...     for sent in tagged_corpus:
...             for word, tag in sent:
...                     words.add(word)
...                     tags.add(tag)
...     vm = dict((w,i) for (i,w) in enumerate(words))
...     tm = dict((t,i) for (i,t) in enumerate(tags))
...     return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus]

>>> from timeit import Timer
>>> vocab_size = 100000
>>> setup_list = "import random; vocab = range(%d)" % vocab_size
>>> setup_set = "import random; vocab = set(range(%d))" % vocab_size
>>> statement = "random.randint(0, %d) in vocab" % vocab_size * 2
>>> print Timer(statement, setup_list).timeit(1000)
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py", line 195, in timeit
    timing = self.inner(it, self.timer)
  File "&lt;timeit-src>", line 6, in inner
NameError: global name 'vocabrandom' is not defined

I searched and the reason of the error was this line. Without (), the statement itself was doubled like "random.randint(0, %d) in vocabrandom.randint(0, %d) in vocab". After revised, I got a result.

>>> statement = "random.randint(0, %d) in vocab" % (vocab_size * 2)
>>> print Timer(statement, setup_list).timeit(1000)
1.3606569767
>>> print Timer(statement, setup_set).timeit(1000)
0.00187087059021

Comparing 4 logics at last.

>>> def virahanka1(n):
...     if n == 0:
...             return [""]
...     elif n == 1:
...             return ["S"]
...     else:
...             s = ["S" + prosody for prosody in virahanka1(n-1)]
...             l = ["L" + prosody for prosody in virahanka1(n-2)]
...             return s + l
... 
>>> virahanka1(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>>
>>> def virahanka2(n):
...     lookup = [[""], ["S"]]
...     for i in range(n-1):
...             s = ["S" + prosody for prosody in lookup[i+1]]
...             l = ["L" + prosody for prosody in lookup[i]]
...             lookup.append(s + l)
...     return lookup[n]
... 
>>> virahanka2(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>>
>>> def virahanka3(n, lookup={0:[""], 1:["S"]}):
...     if n not in lookup:
...             s = ["S" + prosody for prosody in virahanka3(n-1)]
...             l = ["L" + prosody for prosody in virahanka3(n-2)]
...             lookup[n] = s + l
...     return lookup[n]
... 
>>> virahanka3(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>>
>>> from nltk import memoize
>>> @memoize
... def virahanka4(n):
...     if n == 0:
...             return [""]
...     elif n == 1:
...             return ["S"]
...     else:
...             s = ["S" + prosody for prosody in virahanka4(n-1)]
...             l = ["L" + prosody for prosody in virahanka4(n-2)]
...             return s + l
... 
>>> virahanka4(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>>

I needed to read through this chapter several times to understand. The key is to store and reuse previous results for the new calculation. Especially to calculate large data volume, this logic could be helpful.