Normalization for text tokenization (3.7)
Chapter 3.7 of the whale book.
>>> raw = """'When I'M a Duchess, 'she said to herself, (not in a very hopeful ... tone though), 'I wont't have any pepper in my kitchen AT ALL. Soup does very ... well without--Maybe it's always pepper that makes people hot-tempered,'...""" >>> re.split(r' ', raw) ["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful\ntone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
This is simply to split by space( ), but it is not enough. For example, return code(\n) was included.
These two examples, I got the same results.
>>> re.split(r'[ \t\n]+', raw) ["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."] >>> re.split(r'\s+', raw) ["'When", "I'M", 'a', 'Duchess,', "'she", 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."] >>>
I can easily imagine '\t' means tab and '\n' is return code. "\s" is already reserved for group of these characters.
In this context \w is same meaning of [a-zA-Z0-9_]. /W(upper case) means all chars except \w.
>>> re.split(r'\W+', raw) ['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'wont', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', ''] >>>
By this, 'wont't' (it's my typo...) was split into 'wont' and 't'. We should be able to know the reason why '' are inserted at the beginning and the end using this.
>>> 'xx'.split('x') ['', '', ''] >>> re.findall(r'\w+', raw) ['When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'wont', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered']
In the example above, some meaningful symbols like apostrophe(') and dash(-) are removed. The reason is that '/w' contains only alphabets, numbers and underscore(_).
This one adjusts conditions a little. The new condition is "\w+ or \S\w*". "\S" means all non-space characters(=[^ \t\n\r\f\v]). By this change, non-space chars or non-space chars plus normal chars(alphabet, numbers) are recongnized as one tokens.
>>> re.findall(r'\w+|\S\w*', raw) ["'When", 'I', "'M", 'a', 'Duchess', ',', "'she", 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'wont', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']
However, I can still see some non-natural expressions above like '-Maybe' or '-tempered'. This one enables to find words like word plus - or ' plus word as well.
>>> print re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw) ["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "wont't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']
NLTK tokenizer sample.
>>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) #set flag to allow verbose reqexps ... ([A-Z]\.)+ #abbreviations, e.g. U.S.A. ... | \w+(-\w+)* #words with optional internal hyphens ... | \$?\d+(\.\d+)?%? #currency and percentages, e.g. $12.40, 82% ... | \.\.\. #ellipsis ... | [][.,;"'?():-_`] #these are separate tokens ... ''' >>> nltk.regexp_tokenize(text, pattern) ['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...'] >>>