Find stem / Searching tokenized text (3.5.3-3.5.4)
Let's continue. Chapter 3.5.3 of the whale book.
One of examples of stemming.
>>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] ... return word ... >>> stem('various') 'var'
Strangely I could not get the same results as the textbook in this example.
>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing') [''] >>> re.findall(r'(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing') ['ing', ''] >>> re.findall('^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing') ['processing'] >>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing') [('processing', '')]
Finally I found the reason.
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') [('process', 'ing')]
I inserted unnecessary '|' after "ment" by mistake.
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') [('process', 'es')] >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'processes') [('process', 'es')] >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') [('language', '')]
Create a function and test longer text.
>>> def stem(word): ... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem ... >>> raw = """DENNIS: Listen, stange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = nltk.word_tokenize(raw) >>> [stem(t) for t in tokens ... ] ['DENNIS', ':', 'Listen', ',', 'stange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']
The result is far from perfect. As mentioned in the textbook, necessary 's' is removed from "basis" then converted into meaning less word "basi". I understand further improvement necessary for this function.
>>> from nltk.corpus import gutenberg, nps_chat >>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt')) >>> moby.findall(r"<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> chat = nltk.Text(nps_chat.words()) >>> chat.findall(r"<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> chat.findall(r"<l.*>{3,}") lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la la la; lovely lol lol love; lol lol lol.; la la la; la la la >>>
No time to try nltk.re_show(p,s) and nltk.app.nemo()...
>>> from nltk.corpus import brown >>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned'])) >>> hobbies_learned.findall(r"<\w*><and><other><\w*s>") speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals >>> >>> hobbies_learned.findall(r"<as><\w*><as><\w*>") as accurately as possible; as well as the; as faithfully as possible; .... obtainable; as well as the; as important as the; as long as the; as satisfactory as those