What is Tokenize? Part 2 - Deutschina's Tech Diary

Still continuing tokenize.

word_tokenize does not handle some cases as I expected. For example.

>>> word_tokenize(&quot;can't&quot;)
['ca', "n't"]

In my textbook, other tools were introduced. For example, PunktWordTokenizer

>>> from nltk.tokenize import PunktWordTokenizer
>>> tokenizer = PunktWordTokenizer()
>>> tokenizer.tokenize(&quot;Can't is a contraction.&quot;)
['Can', "'t", 'is', 'a', 'contraction.']

Can't was splited into "Can" and "'t". Not cool...

Try another one: WordPunctTokenizer.

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize(&quot;Can't is a contraction.&quot;)
['Can', "'";, 't', 'is', 'a', 'contraction', '.']

Better than previous one but "t" is not a word, I think.

One more: RegexpTokenizer

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize(&quot;Can't is a contraction.&quot;)
["Can't", 'is', 'a', 'contraction']
>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction.']

Tested two patterns and "Can't" as processed as one word. This is more natural for me. The difference is the last period (.) was removed in the first one. RegexpTokenizer should have some options to process.

Need to be familiar with the detail in future. This must be linked to "data cleansing", it is very important to convert the data into meningful one.