Usage of regular expression (3.5.1 - 3.5.2)
Chapter 3.5 of the whale book.
Find out words which include a, e, i, o, u.
>>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word) ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] >>> len(re.findall(r'[aeiou]', word)) 16
This one is to find compound vowels in the word list and count per each combination. (FreqDist)
>>> wsj = sorted(set(nltk.corpus.treebank.words())) >>> fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word)) >>> fd.items() [('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ('oe', 15), ('iu', 14), ('ae', 11), ('eau', 10), ('uo', 8), ('ao', 6), ('oui', 6), ('eou', 5), ('uou', 5), ('uee', 4), ('aa', 3), ('ieu', 3), ('uie', 3), ('eei', 2), ('aia', 1), ('aii', 1), ('aiia', 1), ('eea', 1), ('iai', 1), ('iao', 1), ('ioa', 1), ('oei', 1), ('ooi', 1), ('ueui', 1), ('uu', 1)] >>>
>>> [int(n) for n in re.findall('[0-9]{2,4}', '2009-12-31')] [2009, 12, 31]
Next example:
>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]' >>> def compress(word): ... pieces = re.findall(regexp, word) ... return ' '.join(pieces) ... >>> english_udhr = nltk.corpus.udhr.words('English-Latin1') >>> print nltk.tokenwrap(compress (w) for w in english_udhr[:75]) U n v r s l D c l r t n o f H m n R g h t s P r m b l e W h r s r c g n t n o f t h e i n h r n t d g n t y a n d o f t h e e q l a n d i n l n b l e r g h t s o f a l l m m b r s o f t h e h m n f m l y i s t h e f n d t n o f f r d m , j s t c e a n d p c e i n t h e w r l d , W h r s d s r g r d a n d c n t m p t f r h m n r g h t s h v e r s l t d i n b r b r s a c t s w h c h h v e ou t r g d t h e c n s c n c e o f m n k n d , a n d t h e a d v n t o f a w r l d i n w h c h h m n b n g s s h l l e n j y f r d m o f s p c h a n d
I made a mistake in def compress(word):. Revise it.
>>> def compress(word): ... pieces = re.findall(regexp, word) ... return ''.join(pieces) ... >>> print nltk.tokenwrap(compress (w) for w in english_udhr[:75]) Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd , and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and >>>
I still not clear why '^' is inside of [] in the third condition of regexp. Then tried following.
>>> regexp2 = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[AEIOUaeiou]' >>> def compress2(word): ... pieces = re.findall(regexp2, word) ... return ''.join(pieces) ... >>> print nltk.tokenwrap(compress2 (w) for w in english_udhr[:75]) Uiea eaaio o ua i eae eea eoiio o e iee ii a o e eua a iaieae i o a ee o e ua ai i e ouaio o eeo uie a eae i e o eea iea a oe o ua i ae eue i aaou a i ae ouae e oiee o ai a e ae o a o i i ua ei a eo eeo o ee a >>>
All vowels are extracted. It seems '^' has another meaning something like "not equal".
ConditionalFreqDist for Rotokas language:
>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic') >>> cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)] >>> cfd = nltk.ConditionalFreqDist(cvs) >>> cfd.tabulate() a e i o u k 418 148 94 420 173 p 83 31 105 34 51 r 187 63 84 89 79 s 0 0 100 2 1 t 47 8 0 148 37 v 93 27 105 48 49 >>>
Building index...
>>> cv_word_pairs = [(cv, w) for w in rotokas_words ... for cv in re.findall(r'[ptksvr][aeiou]', w)] >>> cv_index = nltk.Index(cv_word_pairs) >>> cv_index['su'] ['kasuari'] >>> cv_index['po'] ['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto'] >>> cv_index['so'] ['kaekaesoto', 'kekesopa'] >>>
As long as seeing these words, vowels are coming after each consonant. In terms of that, their pronunciation might be similar to my mother language (Japanese).