Normalization for text tokenization (3.7)

Chapter 3.7 of the whale book.

>>> raw = """'When I'M a Duchess, 'she said to herself, (not in a very hopeful
... tone though), 'I wont't have any pepper in my kitchen AT ALL. Soup does very 
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""
>>> re.split(r' ', raw)
["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful\ntone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

This is simply to split by space( ), but it is not enough. For example, return code(\n) was included.

These two examples, I got the same results.

>>> re.split(r'[ \t\n]+', raw)
["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
>>> re.split(r'\s+', raw)
["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
>>>

I can easily imagine '\t' means tab and '\n' is return code. "\s" is already reserved for group of these characters.

In this context \w is same meaning of [a-zA-Z0-9_]. /W(upper case) means all chars except \w.

>>> re.split(r'\W+', raw)
['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'wont', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']
>>>

By this, 'wont't' (it's my typo...) was split into 'wont' and 't'. We should be able to know the reason why '' are inserted at the beginning and the end using this.

>>> 'xx'.split('x')
['', '', '']
>>> re.findall(r'\w+', raw)
['When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'wont', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered']

In the example above, some meaningful symbols like apostrophe(') and dash(-) are removed. The reason is that '/w' contains only alphabets, numbers and underscore(_).

This one adjusts conditions a little. The new condition is "\w+ or \S\w*". "\S" means all non-space characters(=[^ \t\n\r\f\v]). By this change, non-space chars or non-space chars plus normal chars(alphabet, numbers) are recongnized as one tokens.

>>> re.findall(r'\w+|\S\w*', raw)
["'When", 'I', "'M", 'a', 'Duchess', ',', "'she", 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'wont', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

However, I can still see some non-natural expressions above like '-Maybe' or '-tempered'. This one enables to find words like word plus - or ' plus word as well.

>>> print re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

NLTK tokenizer sample.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)  #set flag to allow verbose reqexps
...     ([A-Z]\.)+      #abbreviations, e.g. U.S.A.
...     | \w+(-\w+)*    #words with optional internal hyphens
...     | \$?\d+(\.\d+)?%?      #currency and percentages, e.g. $12.40, 82%
...     | \.\.\.        #ellipsis
...     | [][.,;"'?():-_`]      #these are separate tokens
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
>>>