Find stem / Searching tokenized text (3.5.3-3.5.4)

Let's continue. Chapter 3.5.3 of the whale book.

One of examples of stemming.

>>> def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...             if word.endswith(suffix):
...                     return word[:-len(suffix)]
...     return word
... 
>>> stem('various')
'var'

Strangely I could not get the same results as the textbook in this example.

>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing')
['']
>>> re.findall(r'(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing')
['ing', '']
>>> re.findall('^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing')
['processing']
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment|)$', 'processing')
[('processing', '')]

Finally I found the reason.

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]

I inserted unnecessary '|' after "ment" by mistake.

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'processes')
[('process', 'es')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]

Create a function and test longer text.

>>> def stem(word):
...     regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
...     stem, suffix = re.findall(regexp, word)[0]
...     return stem
... 
>>> raw = """DENNIS: Listen, stange women lying in ponds distributing swords 
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = nltk.word_tokenize(raw)
>>> [stem(t) for t in tokens
... ]
['DENNIS', ':', 'Listen', ',', 'stange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

The result is far from perfect. As mentioned in the textbook, necessary 's' is removed from "basis" then converted into meaning less word "basi". I understand further improvement necessary for this function.

>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"&lt;a>(&lt;.*>)&lt;man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"&lt;.*>&lt;.*>&lt;bro>")
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"&lt;l.*>{3,}")
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
>>>

No time to try nltk.re_show(p,s) and nltk.app.nemo()...

>>> from nltk.corpus import brown
>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
>>> hobbies_learned.findall(r"&lt;\w*>&lt;and>&lt;other>&lt;\w*s>")
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals
>>>
>>> hobbies_learned.findall(r"&lt;as>&lt;\w*>&lt;as>&lt;\w*>")
as accurately as possible; as well as the; as faithfully as possible;
....
obtainable; as well as the; as important as the; as long as the; as
satisfactory as those