Regular expression (3.4)

Chapter 3.4 in the whale book.

Preparation:

>>> import re
>>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
>>> 

Find words which end with "ed".

>>> [w for w in wordlist if re.search('ed$', w)]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', 'abridged', 'abscessed', 
....

'yellowweed', 'yolked', 'younghearted', 'zagged', 'zed', 'zeed', 'zigzagged', 'zonated', 'zoned']
>>> 

Dot(.) is a wildcard representing one character.

>>> [w for w in wordlist if re.search('^..j..t..$', w)]
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']
>>> 

This example to get a list of words to meet following conditions:

  • 3rd character is "j"
  • 6th character is "t"
  • word length is 8

If not to use "^" and "$", the condition should be changed as follows:

  • "j" is included at 3rd char or later
  • 3 char after "j" is "t"
  • word length is longer than (and including) 8
>>> [w for w in wordlist if re.search('..j..t..', w)]
['abjectedness', 'abjection', 'abjective', 'abjectly', 'abjectness', 'adjection', 'adjectional', 'adjectival', 
....

'unsubjectable', 'unsubjected', 'unsubjectedness', 'unsubjection', 'unsubjective', 'unsubjectlike']
>>> 

Therefore we can say "^" is for beginning of the word, "$" means end of the word.

>>> sum(1 for w in text if re.search('^e-?mail$', w))
0

Question mark (?) means the previous char (in this case, "-") is optional. In this example, both "e-mail" and "email" are checked and counted.

How about this example?

>>> [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']
>>> 

The results seem like:

  • 4 characters
  • 1st char is g, h or i
  • 2nd char is m, n or o
  • 3rd char is j, l or k
  • 4th char is d, e or f

Does "+" get rid of the restriction of the length?

>>> [w for w in wordlist if re.search('^[g-o]+$', w)]
['g', 'ghoom', 'gig', 'giggling', 'gigolo', 'gilim', 'gill', 'gilling', 'gilo', 'gim', 'gin', 'ging', 'gingili', 'gink', 'ginkgo', 'ginning', 'gio', 'glink', 'glom', 'glonoin', 'gloom', 'glooming', 'gnomon', 'go', 'gog', 'gogo', 'goi', 'going', 'gol', 'goli', 'gon', 'gong', 'gonion', 'goo', 'googol', 'gook', 'gool', 'goon', 'h', 'hi', 'high', 'hill', 'him', 'hin', 'hing', 'hinoki', 'ho', 'hog', 'hoggin', 'hogling', 'hoi', 'hoin', 'holing', 'holl', 'hollin', 'hollo', 'hollong', 'holm', 'homo', 'homologon', 'hong', 'honk', 'hook', 'hoon', 'i', 'igloo', 'ihi', 'ilk', 'ill', 'imi', 'imino', 'immi', 'in', 'ing', 'ingoing', 'inion', 'ink', 'inkling', 'inlook', 'inn', 'inning', 'io', 'ion', 'j', 'jhool', 'jig', 'jing', 'jingling', 'jingo', 'jinjili', 'jink', 'jinn', 'jinni', 'jo', 'jog', 'johnin', 'join', 'joining', 'joll', 'joom', 'k', 'kiki', 'kil', 'kilhig', 'kilim', 'kill', 'killing', 'kiln', 'kilo', 'kim', 'kimono', 'kin', 'king', 'kingling', 'kink', 'kino', 'klom', 'knoll', 'ko', 'kohl', 'koi', 'koil', 'koilon', 'koinon', 'kokil', 'kokio', 'koko', 'kokoon', 'kolo', 'kolokolo', 'kon', 'kongoni', 'konini', 'l', 'li', 'lignin', 'liin', 'likin', 'liking', 'liknon', 'lill', 'lim', 'liming', 'limn', 'limonin', 'lin', 'ling', 'lingo', 'linin', 'lining', 'link', 'linking', 'linn', 'lino', 'linolin', 'linon', 'lion', 'lo', 'log', 'loggin', 'logging', 'login', 'logion', 'logoi', 'loin', 'loll', 'long', 'longing', 'loo', 'look', 'looking', 'loom', 'looming', 'loon', 'm', 'mho', 'mi', 'mig', 'miglio', 'mignon', 'mijl', 'mil', 'milk', 'milking', 'mill', 'milling', 'million', 'milo', 'mim', 'min', 'ming', 'minikin', 'minim', 'mining', 'minion', 'mink', 'minning', 'mino', 'mo', 'mog', 'mogo', 'moho', 'moil', 'moiling', 'moio', 'mojo', 'moki', 'moko', 'momo', 'mon', 'mong', 'monk', 'mono', 'moo', 'mooing', 'mool', 'moon', 'mooning', 'n', 'ni', 'nig', 'niggling', 'nigh', 'nil', 'nim', 'ninon', 'niog', 'no', 'nog', 'noggin', 'nogging', 'noil', 'noll', 'nolo', 'non', 'nonillion', 'nonion', 'nook', 'nooking', 'noon', 'nooning', 'o', 'oh', 'ohm', 'oho', 'oii', 'oil', 'oki', 'olio', 'olm', 'om', 'on', 'ongoing', 'onion', 'onlook', 'onlooking', 'oolong']
>>> 
>>> >>> [w for w in wordlist if re.search('^[m-o]+$', w)]
['m', 'mo', 'momo', 'mon', 'mono', 'moo', 'moon', 'n', 'no', 'non', 'noon', 'o', 'om', 'on']
>>> [w for w in wordlist if re.search('^[g-i]+$', w)]
['g', 'gig', 'h', 'hi', 'high', 'i', 'ihi']
>>> [w for w in wordlist if re.search('^[j-l]+$', w)]
['j', 'k', 'l']

Another example of "+":

>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
>>> [w for w in chat_words if re.search('^m+i+n+e+$', w)]
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
>>> [w for w in chat_words if re.search('^[ha]+$', w)]
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', 'hahahahaaa', 'hahahahahaha', 'hahahahahahaha', 'hahahahahahahahahahahahahahahaha', 'hahahhahah', 'hahhahahaha']
>>> 

The right answer seems that "+" is used for repeating. The first example above is the sequence of characters is m->i->n->e and each character can be repeated. The second one is [ha], so 'h' or 'a' can be repeated.

Go thorough other examples.

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60',
...
'96.4', '98.3', '99.1', '99.3']

This extracts numbers with decimals. Backslash (\) should be escape symbol. Dot (.) here is just used as decimal point.

>>> [w for w in wsj if re.search('^[A-Z]+\$$', w)]
['C$', 'US$']

Start with alphabet and end with '$'. Should be "\$" as '$' is a special character to indicate end of words.

>>> [w for w in wsj if re.search('^[0-9]{4}$', w)]
['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005', '2009', '2017', '2019', '2029', '3057', '8300']

{} is used to restrict the length. This example is to find 4 digit numbers.

>>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day']

Start with numbers plus '-' then alphabet words with length 3 to 5.

>>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan']

Similar to the previous one but have word length limitation. An alphabet word with 5 chars and longer, a 2 or 3 chars word and a up to and including 6 chars word are connected with '-'.

>>> [w for w in wsj if re.search('(ed|ing)$', w)]
['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', 'Alfred', 'Allied', 'Annualized',
....
'yielded', 'yielding', 'yttrium-containing', 'zoomed']
>>> 

Words end with "ed" or "ing". What happen if remove () from (ed|ing)$ ?

>>> len([w for w in wsj if re.search('ed|ing$', w)])
1969
>>> len([w for w in wsj if re.search('(ed|ing)$', w)])
1841
>>> wob = [w for w in wsj if re.search('ed|ing$', w)]
>>> wib = [w for w in wsj if re.search('(ed|ing)$', w)]
>>> [w for w in wob if w not in wib]
['Biedermann', 'Breeden', 'Cathedral', 'Cedric', 'Confederation', 'Credit', 'Federal', 'Federalist', 'Federation', 'Freddie', 'Frederick', 'Friedrichs', 'Impediments', 'Intermediate', 'Kennedy', 'Media', 'Medical', 'Medicine', 'Mercedes', 'Montedison', 'Nederlanden', 'Needham', 'Proceeds', 'Reddington', 'Redevelopment', 'Roederer', 'Speedway', 'Sweden', 'Teddy', 'Toledo', 'Wednesday', 'Wedtech', 'acknowledge', 'acknowledges', 'agreed-upon', 'allegedly', 'beds', 'buttoned-down', 'closed-end', 'comedies', 'concede', 'concedes', 'credentials', 'credibility', 'credit', 'creditor', 'creditors', 'credits', 'creditworthiness', 'deeds', 'discredit', 'edition', 'editions', 'editor', 'editorial', 'editorially', 'editors', 'education', 'educational', 'educators', 'exceedingly', 'exceeds', 'federal', 'federally', 'feeds', 'fixed-income', 'fixed-price', 'fixed-rate', 'freedom', 'freedoms', 'greedy', 'hundreds', 'immediate', 'immediately', 'impede', 'incredible', 'ingredients', 'intermediate', 'knowledge', 'knowledgeable', 'limited-partnership', 'medallions', 'media', 'medical', 'medicine', 'mediocre', 'needle-like', 'needs', 'needy', 'obedient', 'pediatrician', 'pianist-comedian', 'precedent', 'precedes', 'predecessor', 'predict', 'predictable', 'predictably', 'predicts', 'predispose', 'procedural', 'procedure', 'procedures', 'proceedings', 'proceeds', 'recede', 'red-and-white', 'red-carpet', 'red-flag', 'redeem', 'redemption', 'redeploy', 'redistribute', 'reds', 'reduce', 'reduction', 'reductions', 'repeatedly', 'reportedly', 'schedule', 'secede', 'seduce', 'single-handedly', 'speedway', 'staff-reduction', 'succeeds', 'supposedly', 'weddings']
>>> 

It seems that words which contains 'ed' or 'ing' in the mid of words are also selected.