Usage of regular expression (3.5.1 - 3.5.2)

Chapter 3.5 of the whale book.

Find out words which include a, e, i, o, u.

>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
16

This one is to find compound vowels in the word list and count per each combination. (FreqDist)

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
>>> fd.items()
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ('oe', 15), ('iu', 14), ('ae', 11), ('eau', 10), ('uo', 8), ('ao', 6), ('oui', 6), ('eou', 5), ('uou', 5), ('uee', 4), ('aa', 3), ('ieu', 3), ('uie', 3), ('eei', 2), ('aia', 1), ('aii', 1), ('aiia', 1), ('eea', 1), ('iai', 1), ('iao', 1), ('ioa', 1), ('oei', 1), ('ooi', 1), ('ueui', 1), ('uu', 1)]
>>> 
>>> [int(n) for n in re.findall('[0-9]{2,4}', '2009-12-31')]
[2009, 12, 31]

Next example:

>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ' '.join(pieces)
... 
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print nltk.tokenwrap(compress (w) for w in english_udhr[:75])
U n v r s l D c l r t n o f H m n R g h t s P r m b l e W h r s r c g
n t n o f t h e i n h r n t d g n t y a n d o f t h e e q l a n d i n
l n b l e r g h t s o f a l l m m b r s o f t h e h m n f m l y i s t
h e f n d t n o f f r d m , j s t c e a n d p c e i n t h e w r l d ,
W h r s d s r g r d a n d c n t m p t f r h m n r g h t s h v e r s l
t d i n b r b r s a c t s w h c h h v e ou t r g d t h e c n s c n c e
o f m n k n d , a n d t h e a d v n t o f a w r l d i n w h c h h m n
b n g s s h l l e n j y f r d m o f s p c h a n d

I made a mistake in def compress(word):. Revise it.

>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ''.join(pieces)
... 
>>> print nltk.tokenwrap(compress (w) for w in english_udhr[:75])
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and
>>> 

I still not clear why '^' is inside of [] in the third condition of regexp. Then tried following.

>>> regexp2 = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[AEIOUaeiou]'
>>> def compress2(word):
...     pieces = re.findall(regexp2, word)
...     return ''.join(pieces)
... 
>>> print nltk.tokenwrap(compress2 (w) for w in english_udhr[:75])
Uiea eaaio o ua i eae eea eoiio o e iee ii a o e eua a iaieae i o a ee
o e ua ai i e ouaio o eeo  uie a eae i e o  eea iea a oe o ua i ae eue
i aaou a i ae ouae e oiee o ai  a e ae o a o i i ua ei a eo eeo o ee a
>>> 

All vowels are extracted. It seems '^' has another meaning something like "not equal".

ConditionalFreqDist for Rotokas language:

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
     a    e    i    o    u
k  418  148   94  420  173
p   83   31  105   34   51
r  187   63   84   89   79
s    0    0  100    2    1
t   47    8    0  148   37
v   93   27  105   48   49
>>> 

Building index...

>>> cv_word_pairs = [(cv, w) for w in rotokas_words
...     for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index['su']
['kasuari']
>>> cv_index['po']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']
>>> cv_index['so']
['kaekaesoto', 'kekesopa']
>>> 

As long as seeing these words, vowels are coming after each consonant. In terms of that, their pronunciation might be similar to my mother language (Japanese).