Exercise: Chapter 3 (18-21)
18.
>>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt') >>> words = nltk.word_tokenize(text) >>> list = sorted(set([w for w in words if re.search(r'^wh', w.lower())])) >>> for word in list: ... print word ... WHALE WHALE-FISHERY. WHALE-SHIP WHALE. WHALEBONE WHALEMAN WHALES WHALESHIPS WHALING WHALING. WHARTON WHAT WHEN WHERE WHICH WHIFF WHITE WHOEL Whale Whale's Whale's. Whale-Bones Whale-balls Whale-bone Whale-ship Whale-ships Whale-teeth Whale. Whalebone Whaleman Whalemen Whaler Whales Whales. Whaling Whaling. What What's Whatever Wheelbarrow. Whelped When Whence Whenever Where Where-away Whereas Wherefore Wherein Whereupon Whether Whew Which While Whilst Whirlpooles Whisper White Whitehall Whiteness Whitsuntide Who Who's Who-e Whole Whom Whose Whosoever Why whale whale's whale-boat whale-boat. whale-boats whale-bone whale-books. whale-craft whale-cruisers whale-cry whale-e whale-fastener whale-fish whale-fishers whale-fishery whale-fleet. whale-ground whale-hater whale-hunt whale-hunter whale-hunters whale-hunters. whale-hunting whale-jets whale-killer whale-lance whale-lance. whale-line whale-line. whale-lines whale-lines. whale-naturalists whale-pike whale-pole whale-ports whale-ship whale-ship. whale-ships whale-smitten whale-spades whale-spout whale-steak whale-surgeon whale-trover whale-wise whale. whale.* whaleboat whaleboats whalebone whalebone. whaleboning whaled whaleman whaleman's whaleman. whalemen whalemen's whalemen. whaler whaler. whalers whalers. whales whales. whaleship whaleships whalesmen whaling whaling-craft whaling-fleet whaling-pike whaling-scenes whaling-ships whaling-spade whaling-spades whaling-vessels whaling-voyage whaling. whang wharf wharf. wharves wharves. what what's what. whatever whatsoever whatsoever. wheat wheat. wheel wheel-spokes wheel. wheelbarrow wheeled wheeling wheels wheezing whelm whelmed whelmed. whelmings when whence whencesoe'er whenever where where'er where. whereas whereat whereby wherefore wherein whereof whereon wheresoe'er whereto whereupon wherever wherewith whether whets whetstone whetstones whew which whichever whiff whiffs while while. whim whimsicalities whimsiness whip whipped whipping whips whirl whirl. whirled whirling whirlpool whirls whirlwinds whisker whiskers whiskey whisper whispered whispering whisperingly whispers whispers. whist-tables whistle whistled whistling whistlingly whit white white-ash white-bearded white-bone white-elephant white-fire white-headed white-horse white-lead white-shrouded white-turbaned whitened whiteness whiteness. whitenesses whites whitest whitewashed whither whitish whitish. whittled whittling whittling. whizzings who who-ee whoever whole whole. wholesome wholly whom whooping whose whosoever why why. >>>
Some words are duplicated because of upper/lower cases or dot(.) after words.
19.
>>> nlist = open('word_number.txt').readlines() >>> nlist ['fuzzy 53\n', 'funny 44\n', 'future 65\n', 'fun 12\n', 'gun 48\n', 'music 33\n', 'punk 21\n', 'quick 9\n', 'run 71\n', 'sun 42\n', 'tunnel 18\n'] >>> slist = re.split(r' ', nlist) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 167, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: expected string or buffer
re.split() cannot be used for list? Then use for loop.
>>> for element in nlist: ... nlist2.append(re.split(r' ', element)) ... >>> nlist2 [['fuzzy', '53\n'], ['funny', '44\n'], ['future', '65\n'], ['fun', '12\n'], ['gun', '48\n'], ['music', '33\n'], ['punk', '21\n'], ['quick', '9\n'], ['run', '71\n'], ['sun', '42\n'], ['tunnel', '18\n']] >>> for element in nlist2: ... element[1] = int(element[1]) ... >>> nlist2 [['fuzzy', 53], ['funny', 44], ['future', 65], ['fun', 12], ['gun', 48], ['music', 33], ['punk', 21], ['quick', 9], ['run', 71], ['sun', 42], ['tunnel', 18]] >>>
20.
Use this URL: http://weather.yahoo.com/china/shanghai/shanghai-2151849/
>>> from urllib import urlopen >>> url = 'http://weather.yahoo.com/china/shanghai/shanghai-2151849/' >>> html = urlopen(url).read()
Then removing html tags.
>>> raw = nltk.clean_html(html) >>> raw.index('Today') 3308 >>> raw[3308:3408] 'Today Mostly Cloudy High 83° High 28° Low 70° Low 21° Tomorrow Scattered Thunde' >>>
Try to find the index of word 'Today' then got 100 chars from the index.
Today is Mostly Cloudy and the Highest temperature will be 83F/28C. We are still in May, aren't we? It is too hot!!!
21.
I used this web page as a sample:
Yuan gains strength as PBOC sets record rate
Just do some small test before writing functions:
>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/" >>> html = urlopen(url).read() >>> raw = nltk.clean_html(html) >>> raw "Yuan gains strength as PBOC sets record rate -- Shanghai Daily | \xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5 -- English Window to China New \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n \r\n \r\n \r\n\r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t Yuan gains strength as PBOC sets record rate http://www.shanghaidaily.com/article/?id=531420&type=Business \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t \r\n\r\n \r\n\t\t Mobile Version | \r\n\t\t\tSaturday, 25 May, 2013 | Last updated 18 minutes ago\r\n\t\t \r\n\t\t\r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t\t Metro \r\n\t\t \r\n\t\t \r\n\t\t\t Business \r\n\t\t \r\n\t\t \r\n\t\t\t National \r\n\t\t \r\n\t\t \r\n\t\t\t World \r\n\t\t \r\n\t\t \r\n\t\t\t Sports \r\n\t\t \r\n\t\t \r\n\t\t\t Feature \r\n\t\t \r\n\t\t \r\n\t\t\t Opinion \r\n\t\t \r\n\t\t \r\n\t\t\t V IBE \r\n\t\t \r\n\t\t \r\n\t\t\t i DEAL \r\n\t\t \r\n\t\t\t \r\n\t\t\t PDF \r\n\t\t \r\n\t\t \r\n\t\t\t Gallery \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t RSS | MMS Newspaper | Newsletter \r\n\t\t \r\n\t \r\n \r\n\r\n\t \r\n\t \r\n\t\r\n\t\t\r\n\t\t \r\n\t\t\t Business | Economy \r\n\t\t\t Yuan gains strength as PBOC sets record rate \r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\tBy Feng Jianmin | \r\n\t\t\t\t2013-5-25 | \r\n\t\t\t\t\r\n\t\t\t\t NEWSPAPER EDITION\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\t \r\n\t\t \r\n\t\tThe story appears on \r\n\t\t Page A7 \r\n\t\t \r\n\t\tMay 25, 2013\r\n\t\t \r\n\t\tFree for subscribers\r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t Shopping Cart \r\n\t\t \r\n\t \r\n\t \r\n\t\r\n Reading Tools \r\n \r\n Email Story \r\n Printable View \r\n Blog Story \r\n Copy Headline/URL \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\r\n\r\n \r\n \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t Keywords \r\n\t\t\t\t \r\n\t\t\t\t Financial crisis \r\n\t\t\t\t \r\n\t\t\t\t 3G network \r\n\t\t\t\t \r\n\t\t\t\t Shanghai stock market \r\n\t\t\t\t \r\n\t\t\t\t Housing price \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t Related Stories \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t Huiyuan to buy unit from its chairman \r\n\t\t\t\t 2013-5-25 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan extends rising streak \r\n\t\t\t\t 2013-5-4 0:07:42 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan band widening in the works \r\n\t\t\t\t 2013-4-20 0:42:09 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng's popularity pushes demand for home br... \r\n\t\t\t\t 2013-3-30 1:18:33 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng Liyuan steals hearts on first trip \r\n\t\t\t\t 2013-3-23 1:45:11 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t Read More \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\r\n\t\t\t\t\r\n\t\t\t\r\n\t\r\n \r\n\t\t\t\r\n\t\t\t\r\n\r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\r\n \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\tTHE Chinese yuan yesterday strengthened above the 6.13 level against the US dollar on the spot market for the first time in 19 years after the central bank set a record reference rate for the currency and Premier Li Keqiang reiterated the country was making progress in opening up its capital account. The yuan closed at 6.1316 per dollar in Shanghai yesterday, 0.04 percent stronger than Thursday, according to the China Foreign Exchange Trade System. The yuan touched an intraday high of 6.1279, the strongest since the government unified the official and market rates at the end of 1993. The People's Bank of China raised the central parity rate by 0.13 percent to 6.1867 per US dollar yesterday before the market opened. It was the third time that the PBOC had raised the daily fixing to a record in a week, guiding the market rate up 0.17 percent from May 17. The nation is steadily pushing forward market-oriented reforms on interest rates and capital-account convertibility, Premier Li said in a signed article in Neue Zuricher Zeitung, a German-language Swiss newspaper, on Thursday. \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Email Story \r\n\t\t\t\t Printable View \r\n\t\t\t\t Blog Story \r\n\t\t\t\t Copy Headline/URL \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t \r\n \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t \r\n\r\n\t\t \r\n\t\t \r\n\t\t\r\n\t\r\n\t \r\n\t\r\n\t \r\n\t\t\r\n\r\n \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t News text \r\n\t\t\t News title \r\n\t\t\t Photo captions \r\n\t\t\t Live in Shanghai \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t Advanced Search \r\n \r\n \r\n \r\n \r\n \r\n Our Products \r\n \r\n \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t Home \r\n\t\t\t\tDelivery \r\n\t\t\t Online \r\n\t\t\t\tAccount \r\n\t\t\t Amazon \r\n\t\t\t\tKindle \r\n\t\t\t iPhone \r\n\t\t\t\tApp \r\n\t\t\t iPad \r\n\t\t\t\tApp \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t Blackberry Phone App \r\n\t\t\t PlayBook \r\n\t\t\t\tApp \r\n\t\t\t Android \r\n\t\t\t\tApp \r\n\t\t\t Windows Phone App \r\n\t\t\t MMS \r\n\t\t\t\t \xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5 \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n \r\n\r\n\r\n \r\n\r\n\t\r\n\t \r\n\t\t \r\n\t \r\n\r\n\t \r\n\t \r\n \r\n \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n \r\n\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t Metro \r\n\t\t\t\t \r\n\t\t\t\t\t Aging , Crime and public security , Education , Health , Traffic , Urban construction , Weather ... \r\n\t\t\t\t \r\n\t\t\t\t Business \r\n\t\t\t\t \r\n\t\t\t\t\t Banking , Energy , Foreign investment , Insurance , Macro-economy and policy , Real estate , Securities ... \r\n\t\t\t\t \r\n\t\t\t\t National \r\n\t\t\t\t \r\n\t\t\t\t World \r\n\t\t\t\t \r\n\t\t\t\t Odd \r\n\t\t\t\t \r\n\t\t\t\t Districts \r\n\t\t\t\t \r\n\t\t\t\t\t Changning , Hongkou , Huangpu , Jing'an , Luwan , Minhang , Pudong , Putuo , Xuhui , Yangpu , Zhabei ... \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Sport \r\n\t\t\t\t \r\n\t\t\t\t\t Basketball , Boxing , Cricket , Golf , Gymnastics , Ice hockey , Olympics , Rugby union , Soccer , Tennis ... \r\n\t\t\t\t \r\n\t\t\t\t Feature \r\n\t\t\t\t \r\n\t\t\t\t\t Art , City Style , Culture and history , Expat Tales , Fashion , Home Deco , Literature , Music , Stage , Travel ... \r\n\t\t\t\t \r\n\t\t\t\t Opinion \r\n\t\t\t\t \r\n\t\t\t\t\t Chinese perspectives , Foreign perspectives , Shanghai Daily columnists \r\n\t\t\t\t \r\n\t\t\t\t Sunday \r\n\t\t\t\t \r\n\t\t\t\t\t Animal Planet , Book , City Scene , Cover , Film , Food , Home and Deco , Now and Then , People , Style ... \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Supplement \r\n\t\t\t\t \r\n\t\t\t\t Downloads \r\n\t\t\t\t \r\n\t\t\t\t\t PDF , eMagazine \r\n\t\t\t\t \r\n\t\t\t\t Gallery \r\n\t\t\t\t \r\n\t\t\t\t\t Photos , Cartoons , HD Photo Album \r\n\t\t\t\t \r\n\t\t\t\t Blogs \r\n\t\t\t\t \r\n\t\t\t\t\t Buzzword and Shanghai Talk , Word on Street , Team Blog \r\n\t\t\t\t \r\n\t\t\t\t Services \r\n\t\t\t\t \r\n\t\t\t\t\t Subscribe, Advertising Info , Contact Us , RSS Center \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t FEATURED SITES \r\n\t\t\t\t \r\n\t\t\t\t Campus \r\n\t\t\t\t \r\n\t\t\t\t\t Learning , Careers , Students' Club , Prize English , Sense & Simplicity \r\n\t\t\t\t \r\n\t\t\t\t Mini sites \r\n\t\t\t\t \r\n\t\t\t\t\t Undiscovered Zhoushan , Minhang today www.maicaipian.com , Science Podcasting , Elegant Rhythms from the East \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n \r\n\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t @ CONTACT US | BACK TO TOP \r\n\t \r\n\t \r\n \r\n\r\n\r\n\t \r\n\t \r\n\t\t Metro \r\n\t\t World \r\n\t\t National \r\n\t\t Business \r\n\t\t Sports \r\n\t\t Feature \r\n\t\t Opinion \r\n\t \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\t \r\n\t\t\t About Shanghai Daily | \r\n\t\t\t About US 5.0 New | \r\n\t\t\t Advertising | \r\n\t\t\t Term of Use | \r\n\t\t\t RSS | \r\n\t\t\t Privacy Policy | \r\n\t\t\t Contact US | \r\n\t\t\t Shanghai World Expo \r\n\t\t \r\n\t\t \xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403 | \xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346 | \xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7 | \xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012 \r\n\t\t Copyright \xc2\xa9 2001- Shanghai Daily Publishing House. All rights reserved." >>>
Splitting?
>>> words = re.split(r'\s', raw) >>> words ['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '--', 'Shanghai', 'Daily', '|', '\xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5', '--', 'English', 'Window', 'to', 'China', 'New', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http://www.shanghaidaily.com/article/?id=531420&type=Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mobile', 'Version', '|', '', '', '', '', '', 'Saturday,', '25', 'May,', '2013', '|', 'Last', 'updated', '18', 'minutes', 'ago', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sports', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'V', 'IBE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'i', 'DEAL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'RSS', '|', 'MMS', 'Newspaper', '|', 'Newsletter', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '|', 'Economy', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'By', 'Feng', 'Jianmin', '|', '', '', '', '', '', '', '2013-5-25', '|', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NEWSPAPER', 'EDITION', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'The', 'story', 'appears', 'on', '', '', '', '', '', 'Page', 'A7', '', '', '', '', '', '', '', '', '', 'May', '25,', '2013', '', '', '', '', '', '', '', '', 'Free', 'for', 'subscribers', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shopping', 'Cart', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Reading', 'Tools', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', 'Printable', 'View', '', '', '', '', 'Blog', 'Story', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Keywords', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Financial', 'crisis', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3G', 'network', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shanghai', 'stock', 'market', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Housing', 'price', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Related', 'Stories', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '', '', '', '', '', '', '', '2013-5-25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'extends', 'rising', 'streak', '', '', '', '', '', '', '', '2013-5-4', '0:07:42', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '', '', '', '', '', '', '', '2013-4-20', '0:42:09', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', "Peng's", 'popularity', 'pushes', 'demand', 'for', 'home', 'br...', '', '', '', '', '', '', '', '2013-3-30', '1:18:33', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '', '', '', '', '', '', '', '2013-3-23', '1:45:11', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Read', 'More', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6.13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account.', 'The', 'yuan', 'closed', 'at', '6.1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday,', '0.04', 'percent', 'stronger', 'than', 'Thursday,', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System.', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6.1279,', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993.', 'The', "People's", 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0.13', 'percent', 'to', '6.1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened.', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week,', 'guiding', 'the', 'market', 'rate', 'up', '0.17', 'percent', 'from', 'May', '17.', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market-oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital-account', 'convertibility,', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung,', 'a', 'German-language', 'Swiss', 'newspaper,', 'on', 'Thursday.', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', '', '', '', '', 'Printable', 'View', '', '', '', '', '', '', '', '', 'Blog', 'Story', '', '', '', '', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'News', 'text', '', '', '', '', '', '', 'News', 'title', '', '', '', '', '', '', 'Photo', 'captions', '', '', '', '', '', '', 'Live', 'in', 'Shanghai', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Advanced', 'Search', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Our', 'Products', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Home', '', '', '', '', '', '', 'Delivery', '', '', '', '', '', '', 'Online', '', '', '', '', '', '', 'Account', '', '', '', '', '', '', 'Amazon', '', '', '', '', '', '', 'Kindle', '', '', '', '', '', '', 'iPhone', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'iPad', '', '', '', '', '', '', 'App', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blackberry', 'Phone', 'App', '', '', '', '', '', '', 'PlayBook', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Android', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Windows', 'Phone', 'App', '', '', '', '', '', '', 'MMS', '', '', '', '', '', '', '', '\xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Aging', ',', 'Crime', 'and', 'public', 'security', ',', 'Education', ',', 'Health', ',', 'Traffic', ',', 'Urban', 'construction', ',', 'Weather', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Banking', ',', 'Energy', ',', 'Foreign', 'investment', ',', 'Insurance', ',', 'Macro-economy', 'and', 'policy', ',', 'Real', 'estate', ',', 'Securities', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Odd', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Districts', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Changning', ',', 'Hongkou', ',', 'Huangpu', ',', "Jing'an", ',', 'Luwan', ',', 'Minhang', ',', 'Pudong', ',', 'Putuo', ',', 'Xuhui', ',', 'Yangpu', ',', 'Zhabei', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sport', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Basketball', ',', 'Boxing', ',', 'Cricket', ',', 'Golf', ',', 'Gymnastics', ',', 'Ice', 'hockey', ',', 'Olympics', ',', 'Rugby', 'union', ',', 'Soccer', ',', 'Tennis', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Art', ',', 'City', 'Style', ',', 'Culture', 'and', 'history', ',', 'Expat', 'Tales', ',', 'Fashion', ',', 'Home', 'Deco', ',', 'Literature', ',', 'Music', ',', 'Stage', ',', 'Travel', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Chinese', 'perspectives', ',', 'Foreign', 'perspectives', ',', 'Shanghai', 'Daily', 'columnists', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sunday', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Animal', 'Planet', ',', 'Book', ',', 'City', 'Scene', ',', 'Cover', ',', 'Film', ',', 'Food', ',', 'Home', 'and', 'Deco', ',', 'Now', 'and', 'Then', ',', 'People', ',', 'Style', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Supplement', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Downloads', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', ',', 'eMagazine', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Photos', ',', 'Cartoons', ',', 'HD', 'Photo', 'Album', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blogs', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Buzzword', 'and', 'Shanghai', 'Talk', ',', 'Word', 'on', 'Street', ',', 'Team', 'Blog', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Services', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Subscribe,', 'Advertising', 'Info', ',', 'Contact', 'Us', ',', 'RSS', 'Center', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FEATURED', 'SITES', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Campus', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Learning', ',', 'Careers', ',', "Students'", 'Club', ',', 'Prize', 'English', ',', 'Sense', '&', 'Simplicity', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mini', 'sites', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Undiscovered', 'Zhoushan', ',', 'Minhang', 'today', 'www.maicaipian.com', ',', 'Science', 'Podcasting', ',', 'Elegant', 'Rhythms', 'from', 'the', 'East', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '@', 'CONTACT', 'US', '|', '', 'BACK', 'TO', 'TOP', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', 'World', '', '', '', '', '', 'National', '', '', '', '', '', 'Business', '', '', '', '', '', 'Sports', '', '', '', '', '', 'Feature', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'About', 'Shanghai', 'Daily', '|', '', '', '', '', '', '', 'About', 'US', '5.0', 'New', '|', '', '', '', '', '', '', 'Advertising', '|', '', '', '', '', '', '', 'Term', 'of', 'Use', '|', '', '', '', '', '', '', 'RSS', '|', '', '', '', '', '', '', 'Privacy', 'Policy', '|', '', '', '', '', '', '', 'Contact', 'US', '|', '', '', '', '', '', '', 'Shanghai', 'World', 'Expo', '', '', '', '', '', '', '', '', '', '', '\xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403', '|', '\xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346', '|', '\xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7', '|', '\xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012', '', '', '', '', '', 'Copyright', '\xc2\xa9', '2001-', 'Shanghai', 'Daily', 'Publishing', 'House.', 'All', 'rights', 'reserved.'] >>>
Or directly used findall()?
>>> words2 = re.findall(r'\w+', raw) >>> words2 ['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'Shanghai', 'Daily', 'English', 'Window', 'to', 'China', 'New', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http', 'www', 'shanghaidaily', 'com', 'article', 'id', '531420', 'type', 'Business', 'Mobile', 'Version', 'Saturday', '25', 'May', '2013', 'Last', 'updated', '18', 'minutes', 'ago', 'Metro', 'Business', 'National', 'World', 'Sports', 'Feature', 'Opinion', 'V', 'IBE', 'i', 'DEAL', 'PDF', 'Gallery', 'RSS', 'MMS', 'Newspaper', 'Newsletter', 'Business', 'Economy', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'By', 'Feng', 'Jianmin', '2013', '5', '25', 'NEWSPAPER', 'EDITION', 'The', 'story', 'appears', 'on', 'Page', 'A7', 'May', '25', '2013', 'Free', 'for', 'subscribers', 'Shopping', 'Cart', 'Reading', 'Tools', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'Keywords', 'Financial', 'crisis', '3G', 'network', 'Shanghai', 'stock', 'market', 'Housing', 'price', 'Related', 'Stories', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '2013', '5', '25', 'Yuan', 'extends', 'rising', 'streak', '2013', '5', '4', '0', '07', '42', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '2013', '4', '20', '0', '42', '09', 'Peng', 's', 'popularity', 'pushes', 'demand', 'for', 'home', 'br', '2013', '3', '30', '1', '18', '33', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '2013', '3', '23', '1', '45', '11', 'Read', 'More', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6', '13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account', 'The', 'yuan', 'closed', 'at', '6', '1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday', '0', '04', 'percent', 'stronger', 'than', 'Thursday', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6', '1279', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993', 'The', 'People', 's', 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0', '13', 'percent', 'to', '6', '1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week', 'guiding', 'the', 'market', 'rate', 'up', '0', '17', 'percent', 'from', 'May', '17', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market', 'oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital', 'account', 'convertibility', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung', 'a', 'German', 'language', 'Swiss', 'newspaper', 'on', 'Thursday', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'News', 'text', 'News', 'title', 'Photo', 'captions', 'Live', 'in', 'Shanghai', 'Advanced', 'Search', 'Our', 'Products', 'Home', 'Delivery', 'Online', 'Account', 'Amazon', 'Kindle', 'iPhone', 'App', 'iPad', 'App', 'Blackberry', 'Phone', 'App', 'PlayBook', 'App', 'Android', 'App', 'Windows', 'Phone', 'App', 'MMS', 'Metro', 'Aging', 'Crime', 'and', 'public', 'security', 'Education', 'Health', 'Traffic', 'Urban', 'construction', 'Weather', 'Business', 'Banking', 'Energy', 'Foreign', 'investment', 'Insurance', 'Macro', 'economy', 'and', 'policy', 'Real', 'estate', 'Securities', 'National', 'World', 'Odd', 'Districts', 'Changning', 'Hongkou', 'Huangpu', 'Jing', 'an', 'Luwan', 'Minhang', 'Pudong', 'Putuo', 'Xuhui', 'Yangpu', 'Zhabei', 'Sport', 'Basketball', 'Boxing', 'Cricket', 'Golf', 'Gymnastics', 'Ice', 'hockey', 'Olympics', 'Rugby', 'union', 'Soccer', 'Tennis', 'Feature', 'Art', 'City', 'Style', 'Culture', 'and', 'history', 'Expat', 'Tales', 'Fashion', 'Home', 'Deco', 'Literature', 'Music', 'Stage', 'Travel', 'Opinion', 'Chinese', 'perspectives', 'Foreign', 'perspectives', 'Shanghai', 'Daily', 'columnists', 'Sunday', 'Animal', 'Planet', 'Book', 'City', 'Scene', 'Cover', 'Film', 'Food', 'Home', 'and', 'Deco', 'Now', 'and', 'Then', 'People', 'Style', 'Supplement', 'Downloads', 'PDF', 'eMagazine', 'Gallery', 'Photos', 'Cartoons', 'HD', 'Photo', 'Album', 'Blogs', 'Buzzword', 'and', 'Shanghai', 'Talk', 'Word', 'on', 'Street', 'Team', 'Blog', 'Services', 'Subscribe', 'Advertising', 'Info', 'Contact', 'Us', 'RSS', 'Center', 'FEATURED', 'SITES', 'Campus', 'Learning', 'Careers', 'Students', 'Club', 'Prize', 'English', 'Sense', 'Simplicity', 'Mini', 'sites', 'Undiscovered', 'Zhoushan', 'Minhang', 'today', 'www', 'maicaipian', 'com', 'Science', 'Podcasting', 'Elegant', 'Rhythms', 'from', 'the', 'East', 'CONTACT', 'US', 'BACK', 'TO', 'TOP', 'Metro', 'World', 'National', 'Business', 'Sports', 'Feature', 'Opinion', 'About', 'Shanghai', 'Daily', 'About', 'US', '5', '0', 'New', 'Advertising', 'Term', 'of', 'Use', 'RSS', 'Privacy', 'Policy', 'Contact', 'US', 'Shanghai', 'World', 'Expo', 'ICP', 'ICP', '05050403', '0909346', '354', 'B2', '20120012', 'Copyright', '2001', 'Shanghai', 'Daily', 'Publishing', 'House', 'All', 'rights', 'reserved'] >>>
This is better result than split(). Then get a sorted list.
>>> sorted(set([w.lower() for w in words2 if w.isalpha()])) ['a', 'about', 'above', 'according', 'account', 'advanced', 'advertising', 'after', 'against', 'aging', 'ago', 'album', 'all', 'amazon', 'an', 'and', 'android', 'animal', 'app', 'appears', 'art', 'article', 'as', 'at', 'back', 'band', 'bank', 'banking', 'basketball', 'before', 'blackberry', 'blog', 'blogs', 'book', 'boxing', 'br', 'business', 'buy', 'buzzword', 'by', 'campus', 'capital', 'captions', 'careers', 'cart', 'cartoons', 'center', 'central', 'chairman', 'changning', 'china', 'chinese', 'city', 'closed', 'club', 'columnists', 'com', 'construction', 'contact', 'convertibility', 'copy', 'copyright', 'country', 'cover', 'cricket', 'crime', 'crisis', 'culture', 'currency', 'daily', 'deal', 'deco', 'delivery', 'demand', 'districts', 'dollar', 'downloads', 'east', 'economy', 'edition', 'education', 'elegant', 'emagazine', 'email', 'end', 'energy', 'english', 'estate', 'exchange', 'expat', 'expo', 'extends', 'fashion', 'feature', 'featured', 'feng', 'film', 'financial', 'first', 'fixing', 'food', 'for', 'foreign', 'forward', 'free', 'from', 'gains', 'gallery', 'german', 'golf', 'government', 'guiding', 'gymnastics', 'had', 'hd', 'headline', 'health', 'hearts', 'high', 'history', 'hockey', 'home', 'hongkou', 'house', 'housing', 'http', 'huangpu', 'huiyuan', 'i', 'ibe', 'ice', 'icp', 'id', 'in', 'info', 'insurance', 'interest', 'intraday', 'investment', 'ipad', 'iphone', 'is', 'it', 'its', 'jianmin', 'jing', 'keqiang', 'keywords', 'kindle', 'language', 'last', 'learning', 'level', 'li', 'literature', 'live', 'liyuan', 'luwan', 'macro', 'maicaipian', 'making', 'market', 'may', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'mobile', 'more', 'music', 'nation', 'national', 'network', 'neue', 'new', 'news', 'newsletter', 'newspaper', 'now', 'odd', 'of', 'official', 'olympics', 'on', 'online', 'opened', 'opening', 'opinion', 'oriented', 'our', 'page', 'parity', 'pboc', 'pdf', 'peng', 'people', 'per', 'percent', 'perspectives', 'phone', 'photo', 'photos', 'planet', 'playbook', 'podcasting', 'policy', 'popularity', 'premier', 'price', 'printable', 'privacy', 'prize', 'products', 'progress', 'public', 'publishing', 'pudong', 'pushes', 'pushing', 'putuo', 'raised', 'rate', 'rates', 'read', 'reading', 'real', 'record', 'reference', 'reforms', 'reiterated', 'related', 'reserved', 'rhythms', 'rights', 'rising', 'rss', 'rugby', 's', 'said', 'saturday', 'scene', 'science', 'search', 'securities', 'security', 'sense', 'services', 'set', 'sets', 'shanghai', 'shanghaidaily', 'shopping', 'signed', 'simplicity', 'since', 'sites', 'soccer', 'sport', 'sports', 'spot', 'stage', 'steadily', 'steals', 'stock', 'stories', 'story', 'streak', 'street', 'strength', 'strengthened', 'stronger', 'strongest', 'students', 'style', 'subscribe', 'subscribers', 'sunday', 'supplement', 'swiss', 'system', 'tales', 'talk', 'team', 'tennis', 'term', 'text', 'than', 'that', 'the', 'then', 'third', 'thursday', 'time', 'title', 'to', 'today', 'tools', 'top', 'touched', 'trade', 'traffic', 'travel', 'trip', 'type', 'undiscovered', 'unified', 'union', 'unit', 'up', 'updated', 'urban', 'url', 'us', 'use', 'v', 'version', 'view', 'was', 'weather', 'week', 'widening', 'window', 'windows', 'word', 'works', 'world', 'www', 'xuhui', 'yangpu', 'years', 'yesterday', 'yuan', 'zeitung', 'zhabei', 'zhoushan', 'zuricher'] >>>
Then remove words which includes in nltk.corpus.words.words().
>>> [w for w in words_list if not w in nltk.corpus.words.words()] ['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher'] >>>
Then writing code based on the results above.
>>> def unknown(url): ... raw = nltk.clean_html(urlopen(url).read()) ... words = re.findall(r'\w+', raw) ... words_list = sorted(set([w.lower() for w in words2 if w.isalpha()])) ... unknown_w = [w for w in words_list if not w in nltk.corpus.words.words()] ... print unknown_w ...
Let's test it.
>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/" >>> unknown(url) ['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher'] >>>
Let's analyzing the results...
Plurals: blogs, captions, careers, cartoons, columnists...
Added ed, ing: opened, oriented, publishing...
These can be included in the vocabulary words (nltk.corpus.words.words()) by good stemming.
Proper nouns: english, chinese, saturday, thursday...
I guess, these words are listed here due to the side-effect of converting all words into lower cases. If the first characters are changed to upper case, these words could be listed in the vocabulary words.
Let me try this:
>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/" >>> raw = nltk.clean_html(urlopen(url).read()) >>> words_list = sorted(set([w.lower() for w in words2 if w.isalpha()])) >>> vocab_word = [w.lower() for w in nltk.corpus.words.words()] >>> [w for w in words_list if not w in vocab_word] ['app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher'] >>>
My assumption seems correct. The vocabulary list should also be converted into lower cases.
Chinese specific words: huangpu, huiyuan, jianmin, hongkou...
They are name or place name in China. This could be normal if not included in the vocabulary list.
New or internet specific words: app, blog, buzzword, emagazine, email, online, expo...
One possibility is that the vocabulary list is not up to date or not appropriate for this kind of internet-based news articles.