Algorithm design 2 (4.7.2-4.7.3)
Improve search speed by building index.
def raw(file): contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): #buggy text = ' ' * 30 + raw(doc) + ' ' * 30 pos = text.index(term) return text[pos-30:pos+30] import nltk, re files = nltk.corpus.movie_reviews.abspaths() idx = nltk.Index((w, f) for f in files for w in raw(f).split())
This could be executed only once from the file, I entered command as following:
>>> query = '' >>> while query != "quit": ... query = raw_input("query> ") ... if query in idx: ... for doc in idx[query]: ... print snippet(doc, query) ... else: ... print "not found..." ... query> japan s . evil ninjas in modern-day japan . . . and nobody wins . s . evil ninjas in modern-day japan . . . and nobody wins . ilm made outside his familial japan , bringing the yakuza tr h gangster movies made * in * japan , there are whole gobs o ot to erase " the data " from japan , a desire to clone and ew people can tolerate ) from japan that turn up really late mated sci-fi movies come from japan , because " titan a . e the first ten minutes of this japanese film , you will never eted , they will be sold to a japanese toy museum , and fina tor of last year's best-known japanese animated film import tor of last year's best-known japanese animated film import tor of last year's best-known japanese animated film import tor of last year's best-known japanese animated film import rector : juzo itami country : japan 1997 cinematography : yo cond machine off the coast of japan and has reserved the rid mind , please tell me . this japanese flick is supposed to amurai's honor code of feudal japan against the tradition bo d chapter of wwii , life in a japanese prison camp . to end d chapter of wwii , life in a japanese prison camp . to end grammer ) -to a toy museum in japan . the gang is happy abou grammer ) -to a toy museum in japan . the gang is happy abou everything to a toy museum in japan for a huge profit . so , everything to a toy museum in japan for a huge profit . so , query>
BTW, I tried to quit, the results was like:
query> quit s funded by her mother . lucy quit working professionally 10 erick . i disliked that movie quite a bit , but since " prac t disaster . babe ruth didn't quit baseball after one season o-be fiance . i think she can quit that job and get a more r and rose mcgowan should just quit acting . she has no chari and get a day job . and don't quit it . kubrick , alas , should have quit while he was ahead . this everyone involved should have quit while they were still ahe l die . so what does joe do ? quit his job , of course ! ! w red " implant . he's ready to quit the biz and get a portion hat he always recorded , they quit and become disillusioned admit that i ? ? ? ve become quite the " scream " fan . no again , the fact that he has quit his job to feel what it's school reunion . he has since quit his job as a travel journ ells one of his friends , " i quit school because i didn't l ms , cursing off the boss and quitting his job ( " today i q e , the arrival of the now ubiquitous videocassette . burt r in capitol city , that he has quit his job and hopes to open before his death at age 67 to quit filmmaking once a homosex - joss's explanation that he quit the priesthood because of is a former prosecutor , and quit because of tensions betwe
This is to create dictionally.
>>> def preprocess(tagged_corpus): ... words = set() ... tags = set() ... for sent in tagged_corpus: ... for word, tag in sent: ... words.add(word) ... tags.add(tag) ... vm = dict((w,i) for (i,w) in enumerate(words)) ... tm = dict((t,i) for (i,t) in enumerate(tags)) ... return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus]
>>> from timeit import Timer >>> vocab_size = 100000 >>> setup_list = "import random; vocab = range(%d)" % vocab_size >>> setup_set = "import random; vocab = set(range(%d))" % vocab_size >>> statement = "random.randint(0, %d) in vocab" % vocab_size * 2 >>> print Timer(statement, setup_list).timeit(1000) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py", line 195, in timeit timing = self.inner(it, self.timer) File "<timeit-src>", line 6, in inner NameError: global name 'vocabrandom' is not defined
I searched and the reason of the error was this line. Without (), the statement itself was doubled like "random.randint(0, %d) in vocabrandom.randint(0, %d) in vocab". After revised, I got a result.
>>> statement = "random.randint(0, %d) in vocab" % (vocab_size * 2) >>> print Timer(statement, setup_list).timeit(1000) 1.3606569767 >>> print Timer(statement, setup_set).timeit(1000) 0.00187087059021
Comparing 4 logics at last.
>>> def virahanka1(n): ... if n == 0: ... return [""] ... elif n == 1: ... return ["S"] ... else: ... s = ["S" + prosody for prosody in virahanka1(n-1)] ... l = ["L" + prosody for prosody in virahanka1(n-2)] ... return s + l ... >>> virahanka1(4) ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] >>> >>> def virahanka2(n): ... lookup = [[""], ["S"]] ... for i in range(n-1): ... s = ["S" + prosody for prosody in lookup[i+1]] ... l = ["L" + prosody for prosody in lookup[i]] ... lookup.append(s + l) ... return lookup[n] ... >>> virahanka2(4) ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] >>> >>> def virahanka3(n, lookup={0:[""], 1:["S"]}): ... if n not in lookup: ... s = ["S" + prosody for prosody in virahanka3(n-1)] ... l = ["L" + prosody for prosody in virahanka3(n-2)] ... lookup[n] = s + l ... return lookup[n] ... >>> virahanka3(4) ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] >>> >>> from nltk import memoize >>> @memoize ... def virahanka4(n): ... if n == 0: ... return [""] ... elif n == 1: ... return ["S"] ... else: ... s = ["S" + prosody for prosody in virahanka4(n-1)] ... l = ["L" + prosody for prosody in virahanka4(n-2)] ... return s + l ... >>> virahanka4(4) ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] >>>
I needed to read through this chapter several times to understand. The key is to store and reuse previous results for the new calculation. Especially to calculate large data volume, this logic could be helpful.