Stemming and lemmatization - Deutschina's Tech Diary

Stemming is technique for removing affixes from a word, ending up with the stem.
I don't know the meaning of the words, "affixes" and "stem" but there is an example in the textbook. The stem of "cooking" is "cook" and "ing" is the suffix.

Porter Stemming Algorithm is the one of the most common stemming algorithms.

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookeri'
>>> stemmer.stem('kicker')
'kickers'
>>> stemmer.stem('books')
'book'
>>> stemmer.stem('said')
'said'
>>> stemmer.stem('feet')
'feet'

Some stemming did not work as expected... It seems this works only with simple case, like just removing 'ing' or 's'.

Another one is LancasterStemmer.

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'
>>> stemmer.stem('feet')
'feet'
>>> stemmer.stem('books')
'book'
>>> stemmer.stem('brought')
'brought'

Only this one worked different from the textbook.

>>> stemmer.stem('ingleside')
'inglesid'

SnowballStemmer supports 13 non-Enlgish languages.

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'ital
ian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'sw
edish')

This example is trying to use Spanish.

>>> spanish_stemmer = SnowballStemmer('spanish')
>>> spanish_stemmer.stem('hola')
u'hol'

The return was uni-code as "u" was added before the value.

Lemmatizing

Lemmatization is smilar to synonyms replacement. A lemma is a root word as opposed to the root stem.

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
>>> lemmatizer.lemmatize('cookbooks')
'cookbook'
>>> lemmatizer.lemmatize('brought', pos='v')
'bring'
>>> lemmatizer.lemmatize('brought')
'brought'

The WordNetLemmatizer refers to the WordNet corpus and uses the morphy() function of the WordNetCorpusReader.

This is comparison between stemming and lemmatizing.

>>> stemmer = PorterStemmer()
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
>>> stemmer.stem('buses')
'buse'
>>> lemmatizer.lemmatize('busses')
'bus'
>>> stemmer.stem('bus')
'bu'

I am still not unclear which case we can use stemming at this moment.