Exercise: Chapter 3 (18-21)

18.

>>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
>>> words = nltk.word_tokenize(text)
>>> list = sorted(set([w for w in words if re.search(r'^wh', w.lower())]))
>>> for word in list:
...     print word
... 
WHALE
WHALE-FISHERY.
WHALE-SHIP
WHALE.
WHALEBONE
WHALEMAN
WHALES
WHALESHIPS
WHALING
WHALING.
WHARTON
WHAT
WHEN
WHERE
WHICH
WHIFF
WHITE
WHOEL
Whale
Whale's
Whale's.
Whale-Bones
Whale-balls
Whale-bone
Whale-ship
Whale-ships
Whale-teeth
Whale.
Whalebone
Whaleman
Whalemen
Whaler
Whales
Whales.
Whaling
Whaling.
What
What's
Whatever
Wheelbarrow.
Whelped
When
Whence
Whenever
Where
Where-away
Whereas
Wherefore
Wherein
Whereupon
Whether
Whew
Which
While
Whilst
Whirlpooles
Whisper
White
Whitehall
Whiteness
Whitsuntide
Who
Who's
Who-e
Whole
Whom
Whose
Whosoever
Why
whale
whale's
whale-boat
whale-boat.
whale-boats
whale-bone
whale-books.
whale-craft
whale-cruisers
whale-cry
whale-e
whale-fastener
whale-fish
whale-fishers
whale-fishery
whale-fleet.
whale-ground
whale-hater
whale-hunt
whale-hunter
whale-hunters
whale-hunters.
whale-hunting
whale-jets
whale-killer
whale-lance
whale-lance.
whale-line
whale-line.
whale-lines
whale-lines.
whale-naturalists
whale-pike
whale-pole
whale-ports
whale-ship
whale-ship.
whale-ships
whale-smitten
whale-spades
whale-spout
whale-steak
whale-surgeon
whale-trover
whale-wise
whale.
whale.*
whaleboat
whaleboats
whalebone
whalebone.
whaleboning
whaled
whaleman
whaleman's
whaleman.
whalemen
whalemen's
whalemen.
whaler
whaler.
whalers
whalers.
whales
whales.
whaleship
whaleships
whalesmen
whaling
whaling-craft
whaling-fleet
whaling-pike
whaling-scenes
whaling-ships
whaling-spade
whaling-spades
whaling-vessels
whaling-voyage
whaling.
whang
wharf
wharf.
wharves
wharves.
what
what's
what.
whatever
whatsoever
whatsoever.
wheat
wheat.
wheel
wheel-spokes
wheel.
wheelbarrow
wheeled
wheeling
wheels
wheezing
whelm
whelmed
whelmed.
whelmings
when
whence
whencesoe'er
whenever
where
where'er
where.
whereas
whereat
whereby
wherefore
wherein
whereof
whereon
wheresoe'er
whereto
whereupon
wherever
wherewith
whether
whets
whetstone
whetstones
whew
which
whichever
whiff
whiffs
while
while.
whim
whimsicalities
whimsiness
whip
whipped
whipping
whips
whirl
whirl.
whirled
whirling
whirlpool
whirls
whirlwinds
whisker
whiskers
whiskey
whisper
whispered
whispering
whisperingly
whispers
whispers.
whist-tables
whistle
whistled
whistling
whistlingly
whit
white
white-ash
white-bearded
white-bone
white-elephant
white-fire
white-headed
white-horse
white-lead
white-shrouded
white-turbaned
whitened
whiteness
whiteness.
whitenesses
whites
whitest
whitewashed
whither
whitish
whitish.
whittled
whittling
whittling.
whizzings
who
who-ee
whoever
whole
whole.
wholesome
wholly
whom
whooping
whose
whosoever
why
why.
>>> 

Some words are duplicated because of upper/lower cases or dot(.) after words.

19.

>>> nlist = open('word_number.txt').readlines()
>>> nlist
['fuzzy 53\n', 'funny 44\n', 'future 65\n', 'fun 12\n', 'gun 48\n', 'music 33\n', 'punk 21\n', 'quick 9\n', 'run 71\n', 'sun 42\n', 'tunnel 18\n']
>>> slist = re.split(r' ', nlist)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 167, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: expected string or buffer

re.split() cannot be used for list? Then use for loop.

>>> for element in nlist:
...     nlist2.append(re.split(r' ', element))
... 
>>> nlist2
[['fuzzy', '53\n'], ['funny', '44\n'], ['future', '65\n'], ['fun', '12\n'], ['gun', '48\n'], ['music', '33\n'], ['punk', '21\n'], ['quick', '9\n'], ['run', '71\n'], ['sun', '42\n'], ['tunnel', '18\n']]
>>> for element in nlist2:
...     element[1] = int(element[1])
... 
>>> nlist2
[['fuzzy', 53], ['funny', 44], ['future', 65], ['fun', 12], ['gun', 48], ['music', 33], ['punk', 21], ['quick', 9], ['run', 71], ['sun', 42], ['tunnel', 18]]
>>> 

20.

Use this URL: http://weather.yahoo.com/china/shanghai/shanghai-2151849/

>>> from urllib import urlopen
>>> url = 'http://weather.yahoo.com/china/shanghai/shanghai-2151849/'
>>> html = urlopen(url).read()

Then removing html tags.

>>> raw = nltk.clean_html(html)
>>> raw.index('Today')
3308
>>> raw[3308:3408]
'Today Mostly Cloudy High 83° High 28° Low 70° Low 21°  Tomorrow Scattered Thunde'
>>> 

Try to find the index of word 'Today' then got 100 chars from the index.

Today is Mostly Cloudy and the Highest temperature will be 83F/28C. We are still in May, aren't we? It is too hot!!!

21.

I used this web page as a sample:
Yuan gains strength as PBOC sets record rate

Just do some small test before writing functions:

>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> html = urlopen(url).read()
>>> raw = nltk.clean_html(html)
>>> raw
"Yuan gains strength as PBOC sets record rate -- Shanghai Daily | \xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5 -- English Window to China New \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n \r\n \r\n \r\n\r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t Yuan gains strength as PBOC sets record rate http://www.shanghaidaily.com/article/?id=531420&type=Business \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t \r\n\r\n   \r\n\t\t Mobile Version | \r\n\t\t\tSaturday, 25 May, 2013 | Last updated 18 minutes ago\r\n\t\t \r\n\t\t\r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t\t Metro \r\n\t\t \r\n\t\t \r\n\t\t\t Business \r\n\t\t \r\n\t\t \r\n\t\t\t National \r\n\t\t \r\n\t\t \r\n\t\t\t World \r\n\t\t \r\n\t\t \r\n\t\t\t Sports \r\n\t\t \r\n\t\t \r\n\t\t\t Feature \r\n\t\t \r\n\t\t \r\n\t\t\t Opinion \r\n\t\t \r\n\t\t \r\n\t\t\t V IBE \r\n\t\t \r\n\t\t \r\n\t\t\t i DEAL \r\n\t\t \r\n\t\t\t \r\n\t\t\t PDF \r\n\t\t \r\n\t\t \r\n\t\t\t Gallery \r\n\t\t \r\n\t\t \r\n\t\t\t  \r\n\t\t \r\n\t\t  \r\n\t\t\r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t RSS | MMS Newspaper | Newsletter \r\n\t\t \r\n\t \r\n \r\n\r\n\t \r\n\t \r\n\t\r\n\t\t\r\n\t\t \r\n\t\t\t Business | Economy \r\n\t\t\t Yuan gains strength as PBOC sets record rate \r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\tBy Feng Jianmin | \r\n\t\t\t\t2013-5-25 | \r\n\t\t\t\t\r\n\t\t\t\t NEWSPAPER EDITION\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\t \r\n\t\t \r\n\t\tThe story appears on \r\n\t\t Page A7 \r\n\t\t \r\n\t\tMay 25, 2013\r\n\t\t \r\n\t\tFree for subscribers\r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t Shopping Cart \r\n\t\t \r\n\t \r\n\t \r\n\t\r\n Reading Tools \r\n \r\n  Email Story \r\n  Printable View \r\n  Blog Story \r\n  Copy Headline/URL \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\r\n\r\n \r\n \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t Keywords \r\n\t\t\t\t \r\n\t\t\t\t Financial crisis \r\n\t\t\t\t \r\n\t\t\t\t 3G network \r\n\t\t\t\t \r\n\t\t\t\t Shanghai stock market \r\n\t\t\t\t \r\n\t\t\t\t Housing price \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t Related Stories \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t Huiyuan to buy unit from its chairman \r\n\t\t\t\t 2013-5-25 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan extends rising streak \r\n\t\t\t\t 2013-5-4 0:07:42 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan band widening in the works \r\n\t\t\t\t 2013-4-20 0:42:09 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng's popularity pushes demand for home br... \r\n\t\t\t\t 2013-3-30 1:18:33 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng Liyuan steals hearts on first trip \r\n\t\t\t\t 2013-3-23 1:45:11 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t Read More \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\r\n\t\t\t\t\r\n\t\t\t\r\n\t\r\n      \r\n\t\t\t\r\n\t\t\t\r\n\r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\r\n \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\tTHE Chinese yuan yesterday strengthened above the 6.13 level against the US dollar on the spot market for the first time in 19 years after the central bank set a record reference rate for the currency and Premier Li Keqiang reiterated the country was making progress in opening up its capital account. The yuan closed at 6.1316 per dollar in Shanghai yesterday, 0.04 percent stronger than Thursday, according to the China Foreign Exchange Trade System. The yuan touched an intraday high of 6.1279, the strongest since the government unified the official and market rates at the end of 1993. The People's Bank of China raised the central parity rate by 0.13 percent to 6.1867 per US dollar yesterday before the market opened. It was the third time that the PBOC had raised the daily fixing to a record in a week, guiding the market rate up 0.17 percent from May 17. The nation is steadily pushing forward market-oriented reforms on interest rates and capital-account convertibility, Premier Li said in a signed article in Neue Zuricher Zeitung, a German-language Swiss newspaper, on Thursday. \r\n\t\t\t\t \r\n\t\t\t\t  \r\n\t\t\t\t\r\n\t\t\t\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Email Story \r\n\t\t\t\t  Printable View \r\n\t\t\t\t  Blog Story \r\n\t\t\t\t  Copy Headline/URL \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t      \r\n \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t \r\n\r\n\t\t \r\n\t\t \r\n\t\t\r\n\t\r\n\t \r\n\t\r\n\t \r\n\t\t\r\n\r\n \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t News text \r\n\t\t\t News title \r\n\t\t\t Photo captions \r\n\t\t\t Live in Shanghai \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t Advanced Search \r\n \r\n \r\n \r\n \r\n \r\n  Our Products \r\n \r\n  \r\n\t\t \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t \r\n\t\t \r\n\t\t\t Home \r\n\t\t\t\tDelivery \r\n\t\t\t Online \r\n\t\t\t\tAccount \r\n\t\t\t Amazon \r\n\t\t\t\tKindle \r\n\t\t\t iPhone \r\n\t\t\t\tApp \r\n\t\t\t iPad \r\n\t\t\t\tApp \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t \r\n\t\t \r\n\t\t\t Blackberry Phone App \r\n\t\t\t PlayBook \r\n\t\t\t\tApp \r\n\t\t\t Android \r\n\t\t\t\tApp \r\n\t\t\t Windows Phone App \r\n\t\t\t MMS \r\n\t\t\t\t \xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5 \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n  \r\n\r\n\r\n \r\n\r\n\t\r\n\t \r\n\t\t \r\n\t \r\n\r\n\t \r\n\t \r\n \r\n \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t   \r\n\t \r\n \r\n\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t Metro \r\n\t\t\t\t \r\n\t\t\t\t\t Aging , Crime and public security , Education , Health , Traffic , Urban construction , Weather ... \r\n\t\t\t\t \r\n\t\t\t\t Business \r\n\t\t\t\t \r\n\t\t\t\t\t Banking , Energy , Foreign investment , Insurance , Macro-economy and policy , Real estate , Securities ... \r\n\t\t\t\t \r\n\t\t\t\t National \r\n\t\t\t\t \r\n\t\t\t\t World \r\n\t\t\t\t \r\n\t\t\t\t Odd \r\n\t\t\t\t \r\n\t\t\t\t Districts \r\n\t\t\t\t \r\n\t\t\t\t\t Changning , Hongkou , Huangpu , Jing'an , Luwan , Minhang , Pudong , Putuo , Xuhui , Yangpu , Zhabei ... \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Sport \r\n\t\t\t\t \r\n\t\t\t\t\t Basketball , Boxing , Cricket , Golf , Gymnastics , Ice hockey , Olympics , Rugby union , Soccer , Tennis ... \r\n\t\t\t\t \r\n\t\t\t\t Feature \r\n\t\t\t\t \r\n\t\t\t\t\t Art , City Style , Culture and history , Expat Tales , Fashion , Home Deco , Literature , Music , Stage , Travel ... \r\n\t\t\t\t \r\n\t\t\t\t Opinion \r\n\t\t\t\t \r\n\t\t\t\t\t Chinese perspectives , Foreign perspectives , Shanghai Daily columnists \r\n\t\t\t\t \r\n\t\t\t\t Sunday \r\n\t\t\t\t \r\n\t\t\t\t\t Animal Planet , Book , City Scene , Cover , Film , Food , Home and Deco , Now and Then , People , Style ... \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Supplement \r\n\t\t\t\t \r\n\t\t\t\t Downloads \r\n\t\t\t\t \r\n\t\t\t\t\t PDF , eMagazine \r\n\t\t\t\t \r\n\t\t\t\t Gallery \r\n\t\t\t\t \r\n\t\t\t\t\t Photos , Cartoons , HD Photo Album \r\n\t\t\t\t \r\n\t\t\t\t Blogs \r\n\t\t\t\t \r\n\t\t\t\t\t Buzzword and Shanghai Talk , Word on Street , Team Blog \r\n\t\t\t\t \r\n\t\t\t\t Services \r\n\t\t\t\t \r\n\t\t\t\t\t Subscribe, Advertising Info , Contact Us , RSS Center \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t FEATURED SITES \r\n\t\t\t\t \r\n\t\t\t\t Campus \r\n\t\t\t\t \r\n\t\t\t\t\t Learning , Careers , Students' Club , Prize English , Sense & Simplicity \r\n\t\t\t\t \r\n\t\t\t\t Mini sites \r\n\t\t\t\t \r\n\t\t\t\t\t Undiscovered Zhoushan , Minhang today www.maicaipian.com , Science Podcasting , Elegant Rhythms from the East \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n \r\n\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t @ CONTACT US |  BACK TO TOP \r\n\t \r\n\t \r\n \r\n\r\n\r\n\t \r\n\t \r\n\t\t Metro \r\n\t\t World \r\n\t\t National \r\n\t\t Business \r\n\t\t Sports \r\n\t\t Feature \r\n\t\t Opinion \r\n\t \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\t \r\n\t\t\t About Shanghai Daily | \r\n\t\t\t About US 5.0 New | \r\n\t\t\t Advertising | \r\n\t\t\t Term of Use | \r\n\t\t\t RSS | \r\n\t\t\t Privacy Policy | \r\n\t\t\t Contact US | \r\n\t\t\t Shanghai World Expo \r\n\t\t \r\n\t\t \xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403 | \xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346 | \xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7 | \xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012 \r\n\t\t Copyright \xc2\xa9 2001- Shanghai Daily Publishing House. All rights reserved."
>>> 

Splitting?

>>> words = re.split(r'\s', raw)
>>> words
['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '--', 'Shanghai', 'Daily', '|', '\xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5', '--', 'English', 'Window', 'to', 'China', 'New', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http://www.shanghaidaily.com/article/?id=531420&type=Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mobile', 'Version', '|', '', '', '', '', '', 'Saturday,', '25', 'May,', '2013', '|', 'Last', 'updated', '18', 'minutes', 'ago', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sports', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'V', 'IBE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'i', 'DEAL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'RSS', '|', 'MMS', 'Newspaper', '|', 'Newsletter', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '|', 'Economy', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'By', 'Feng', 'Jianmin', '|', '', '', '', '', '', '', '2013-5-25', '|', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NEWSPAPER', 'EDITION', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'The', 'story', 'appears', 'on', '', '', '', '', '', 'Page', 'A7', '', '', '', '', '', '', '', '', '', 'May', '25,', '2013', '', '', '', '', '', '', '', '', 'Free', 'for', 'subscribers', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shopping', 'Cart', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Reading', 'Tools', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', 'Printable', 'View', '', '', '', '', 'Blog', 'Story', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Keywords', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Financial', 'crisis', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3G', 'network', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shanghai', 'stock', 'market', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Housing', 'price', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Related', 'Stories', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '', '', '', '', '', '', '', '2013-5-25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'extends', 'rising', 'streak', '', '', '', '', '', '', '', '2013-5-4', '0:07:42', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '', '', '', '', '', '', '', '2013-4-20', '0:42:09', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', "Peng's", 'popularity', 'pushes', 'demand', 'for', 'home', 'br...', '', '', '', '', '', '', '', '2013-3-30', '1:18:33', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '', '', '', '', '', '', '', '2013-3-23', '1:45:11', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Read', 'More', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6.13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account.', 'The', 'yuan', 'closed', 'at', '6.1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday,', '0.04', 'percent', 'stronger', 'than', 'Thursday,', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System.', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6.1279,', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993.', 'The', "People's", 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0.13', 'percent', 'to', '6.1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened.', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week,', 'guiding', 'the', 'market', 'rate', 'up', '0.17', 'percent', 'from', 'May', '17.', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market-oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital-account', 'convertibility,', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung,', 'a', 'German-language', 'Swiss', 'newspaper,', 'on', 'Thursday.', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', '', '', '', '', 'Printable', 'View', '', '', '', '', '', '', '', '', 'Blog', 'Story', '', '', '', '', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'News', 'text', '', '', '', '', '', '', 'News', 'title', '', '', '', '', '', '', 'Photo', 'captions', '', '', '', '', '', '', 'Live', 'in', 'Shanghai', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Advanced', 'Search', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Our', 'Products', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Home', '', '', '', '', '', '', 'Delivery', '', '', '', '', '', '', 'Online', '', '', '', '', '', '', 'Account', '', '', '', '', '', '', 'Amazon', '', '', '', '', '', '', 'Kindle', '', '', '', '', '', '', 'iPhone', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'iPad', '', '', '', '', '', '', 'App', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blackberry', 'Phone', 'App', '', '', '', '', '', '', 'PlayBook', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Android', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Windows', 'Phone', 'App', '', '', '', '', '', '', 'MMS', '', '', '', '', '', '', '', '\xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Aging', ',', 'Crime', 'and', 'public', 'security', ',', 'Education', ',', 'Health', ',', 'Traffic', ',', 'Urban', 'construction', ',', 'Weather', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Banking', ',', 'Energy', ',', 'Foreign', 'investment', ',', 'Insurance', ',', 'Macro-economy', 'and', 'policy', ',', 'Real', 'estate', ',', 'Securities', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Odd', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Districts', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Changning', ',', 'Hongkou', ',', 'Huangpu', ',', "Jing'an", ',', 'Luwan', ',', 'Minhang', ',', 'Pudong', ',', 'Putuo', ',', 'Xuhui', ',', 'Yangpu', ',', 'Zhabei', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sport', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Basketball', ',', 'Boxing', ',', 'Cricket', ',', 'Golf', ',', 'Gymnastics', ',', 'Ice', 'hockey', ',', 'Olympics', ',', 'Rugby', 'union', ',', 'Soccer', ',', 'Tennis', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Art', ',', 'City', 'Style', ',', 'Culture', 'and', 'history', ',', 'Expat', 'Tales', ',', 'Fashion', ',', 'Home', 'Deco', ',', 'Literature', ',', 'Music', ',', 'Stage', ',', 'Travel', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Chinese', 'perspectives', ',', 'Foreign', 'perspectives', ',', 'Shanghai', 'Daily', 'columnists', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sunday', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Animal', 'Planet', ',', 'Book', ',', 'City', 'Scene', ',', 'Cover', ',', 'Film', ',', 'Food', ',', 'Home', 'and', 'Deco', ',', 'Now', 'and', 'Then', ',', 'People', ',', 'Style', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Supplement', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Downloads', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', ',', 'eMagazine', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Photos', ',', 'Cartoons', ',', 'HD', 'Photo', 'Album', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blogs', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Buzzword', 'and', 'Shanghai', 'Talk', ',', 'Word', 'on', 'Street', ',', 'Team', 'Blog', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Services', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Subscribe,', 'Advertising', 'Info', ',', 'Contact', 'Us', ',', 'RSS', 'Center', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FEATURED', 'SITES', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Campus', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Learning', ',', 'Careers', ',', "Students'", 'Club', ',', 'Prize', 'English', ',', 'Sense', '&', 'Simplicity', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mini', 'sites', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Undiscovered', 'Zhoushan', ',', 'Minhang', 'today', 'www.maicaipian.com', ',', 'Science', 'Podcasting', ',', 'Elegant', 'Rhythms', 'from', 'the', 'East', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '@', 'CONTACT', 'US', '|', '', 'BACK', 'TO', 'TOP', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', 'World', '', '', '', '', '', 'National', '', '', '', '', '', 'Business', '', '', '', '', '', 'Sports', '', '', '', '', '', 'Feature', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'About', 'Shanghai', 'Daily', '|', '', '', '', '', '', '', 'About', 'US', '5.0', 'New', '|', '', '', '', '', '', '', 'Advertising', '|', '', '', '', '', '', '', 'Term', 'of', 'Use', '|', '', '', '', '', '', '', 'RSS', '|', '', '', '', '', '', '', 'Privacy', 'Policy', '|', '', '', '', '', '', '', 'Contact', 'US', '|', '', '', '', '', '', '', 'Shanghai', 'World', 'Expo', '', '', '', '', '', '', '', '', '', '', '\xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403', '|', '\xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346', '|', '\xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7', '|', '\xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012', '', '', '', '', '', 'Copyright', '\xc2\xa9', '2001-', 'Shanghai', 'Daily', 'Publishing', 'House.', 'All', 'rights', 'reserved.']
>>> 

Or directly used findall()?

>>> words2 = re.findall(r'\w+', raw)
>>> words2
['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'Shanghai', 'Daily', 'English', 'Window', 'to', 'China', 'New', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http', 'www', 'shanghaidaily', 'com', 'article', 'id', '531420', 'type', 'Business', 'Mobile', 'Version', 'Saturday', '25', 'May', '2013', 'Last', 'updated', '18', 'minutes', 'ago', 'Metro', 'Business', 'National', 'World', 'Sports', 'Feature', 'Opinion', 'V', 'IBE', 'i', 'DEAL', 'PDF', 'Gallery', 'RSS', 'MMS', 'Newspaper', 'Newsletter', 'Business', 'Economy', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'By', 'Feng', 'Jianmin', '2013', '5', '25', 'NEWSPAPER', 'EDITION', 'The', 'story', 'appears', 'on', 'Page', 'A7', 'May', '25', '2013', 'Free', 'for', 'subscribers', 'Shopping', 'Cart', 'Reading', 'Tools', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'Keywords', 'Financial', 'crisis', '3G', 'network', 'Shanghai', 'stock', 'market', 'Housing', 'price', 'Related', 'Stories', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '2013', '5', '25', 'Yuan', 'extends', 'rising', 'streak', '2013', '5', '4', '0', '07', '42', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '2013', '4', '20', '0', '42', '09', 'Peng', 's', 'popularity', 'pushes', 'demand', 'for', 'home', 'br', '2013', '3', '30', '1', '18', '33', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '2013', '3', '23', '1', '45', '11', 'Read', 'More', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6', '13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account', 'The', 'yuan', 'closed', 'at', '6', '1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday', '0', '04', 'percent', 'stronger', 'than', 'Thursday', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6', '1279', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993', 'The', 'People', 's', 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0', '13', 'percent', 'to', '6', '1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week', 'guiding', 'the', 'market', 'rate', 'up', '0', '17', 'percent', 'from', 'May', '17', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market', 'oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital', 'account', 'convertibility', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung', 'a', 'German', 'language', 'Swiss', 'newspaper', 'on', 'Thursday', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'News', 'text', 'News', 'title', 'Photo', 'captions', 'Live', 'in', 'Shanghai', 'Advanced', 'Search', 'Our', 'Products', 'Home', 'Delivery', 'Online', 'Account', 'Amazon', 'Kindle', 'iPhone', 'App', 'iPad', 'App', 'Blackberry', 'Phone', 'App', 'PlayBook', 'App', 'Android', 'App', 'Windows', 'Phone', 'App', 'MMS', 'Metro', 'Aging', 'Crime', 'and', 'public', 'security', 'Education', 'Health', 'Traffic', 'Urban', 'construction', 'Weather', 'Business', 'Banking', 'Energy', 'Foreign', 'investment', 'Insurance', 'Macro', 'economy', 'and', 'policy', 'Real', 'estate', 'Securities', 'National', 'World', 'Odd', 'Districts', 'Changning', 'Hongkou', 'Huangpu', 'Jing', 'an', 'Luwan', 'Minhang', 'Pudong', 'Putuo', 'Xuhui', 'Yangpu', 'Zhabei', 'Sport', 'Basketball', 'Boxing', 'Cricket', 'Golf', 'Gymnastics', 'Ice', 'hockey', 'Olympics', 'Rugby', 'union', 'Soccer', 'Tennis', 'Feature', 'Art', 'City', 'Style', 'Culture', 'and', 'history', 'Expat', 'Tales', 'Fashion', 'Home', 'Deco', 'Literature', 'Music', 'Stage', 'Travel', 'Opinion', 'Chinese', 'perspectives', 'Foreign', 'perspectives', 'Shanghai', 'Daily', 'columnists', 'Sunday', 'Animal', 'Planet', 'Book', 'City', 'Scene', 'Cover', 'Film', 'Food', 'Home', 'and', 'Deco', 'Now', 'and', 'Then', 'People', 'Style', 'Supplement', 'Downloads', 'PDF', 'eMagazine', 'Gallery', 'Photos', 'Cartoons', 'HD', 'Photo', 'Album', 'Blogs', 'Buzzword', 'and', 'Shanghai', 'Talk', 'Word', 'on', 'Street', 'Team', 'Blog', 'Services', 'Subscribe', 'Advertising', 'Info', 'Contact', 'Us', 'RSS', 'Center', 'FEATURED', 'SITES', 'Campus', 'Learning', 'Careers', 'Students', 'Club', 'Prize', 'English', 'Sense', 'Simplicity', 'Mini', 'sites', 'Undiscovered', 'Zhoushan', 'Minhang', 'today', 'www', 'maicaipian', 'com', 'Science', 'Podcasting', 'Elegant', 'Rhythms', 'from', 'the', 'East', 'CONTACT', 'US', 'BACK', 'TO', 'TOP', 'Metro', 'World', 'National', 'Business', 'Sports', 'Feature', 'Opinion', 'About', 'Shanghai', 'Daily', 'About', 'US', '5', '0', 'New', 'Advertising', 'Term', 'of', 'Use', 'RSS', 'Privacy', 'Policy', 'Contact', 'US', 'Shanghai', 'World', 'Expo', 'ICP', 'ICP', '05050403', '0909346', '354', 'B2', '20120012', 'Copyright', '2001', 'Shanghai', 'Daily', 'Publishing', 'House', 'All', 'rights', 'reserved']
>>> 

This is better result than split(). Then get a sorted list.

>>> sorted(set([w.lower() for w in words2 if w.isalpha()]))
['a', 'about', 'above', 'according', 'account', 'advanced', 'advertising', 'after', 'against', 'aging', 'ago', 'album', 'all', 'amazon', 'an', 'and', 'android', 'animal', 'app', 'appears', 'art', 'article', 'as', 'at', 'back', 'band', 'bank', 'banking', 'basketball', 'before', 'blackberry', 'blog', 'blogs', 'book', 'boxing', 'br', 'business', 'buy', 'buzzword', 'by', 'campus', 'capital', 'captions', 'careers', 'cart', 'cartoons', 'center', 'central', 'chairman', 'changning', 'china', 'chinese', 'city', 'closed', 'club', 'columnists', 'com', 'construction', 'contact', 'convertibility', 'copy', 'copyright', 'country', 'cover', 'cricket', 'crime', 'crisis', 'culture', 'currency', 'daily', 'deal', 'deco', 'delivery', 'demand', 'districts', 'dollar', 'downloads', 'east', 'economy', 'edition', 'education', 'elegant', 'emagazine', 'email', 'end', 'energy', 'english', 'estate', 'exchange', 'expat', 'expo', 'extends', 'fashion', 'feature', 'featured', 'feng', 'film', 'financial', 'first', 'fixing', 'food', 'for', 'foreign', 'forward', 'free', 'from', 'gains', 'gallery', 'german', 'golf', 'government', 'guiding', 'gymnastics', 'had', 'hd', 'headline', 'health', 'hearts', 'high', 'history', 'hockey', 'home', 'hongkou', 'house', 'housing', 'http', 'huangpu', 'huiyuan', 'i', 'ibe', 'ice', 'icp', 'id', 'in', 'info', 'insurance', 'interest', 'intraday', 'investment', 'ipad', 'iphone', 'is', 'it', 'its', 'jianmin', 'jing', 'keqiang', 'keywords', 'kindle', 'language', 'last', 'learning', 'level', 'li', 'literature', 'live', 'liyuan', 'luwan', 'macro', 'maicaipian', 'making', 'market', 'may', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'mobile', 'more', 'music', 'nation', 'national', 'network', 'neue', 'new', 'news', 'newsletter', 'newspaper', 'now', 'odd', 'of', 'official', 'olympics', 'on', 'online', 'opened', 'opening', 'opinion', 'oriented', 'our', 'page', 'parity', 'pboc', 'pdf', 'peng', 'people', 'per', 'percent', 'perspectives', 'phone', 'photo', 'photos', 'planet', 'playbook', 'podcasting', 'policy', 'popularity', 'premier', 'price', 'printable', 'privacy', 'prize', 'products', 'progress', 'public', 'publishing', 'pudong', 'pushes', 'pushing', 'putuo', 'raised', 'rate', 'rates', 'read', 'reading', 'real', 'record', 'reference', 'reforms', 'reiterated', 'related', 'reserved', 'rhythms', 'rights', 'rising', 'rss', 'rugby', 's', 'said', 'saturday', 'scene', 'science', 'search', 'securities', 'security', 'sense', 'services', 'set', 'sets', 'shanghai', 'shanghaidaily', 'shopping', 'signed', 'simplicity', 'since', 'sites', 'soccer', 'sport', 'sports', 'spot', 'stage', 'steadily', 'steals', 'stock', 'stories', 'story', 'streak', 'street', 'strength', 'strengthened', 'stronger', 'strongest', 'students', 'style', 'subscribe', 'subscribers', 'sunday', 'supplement', 'swiss', 'system', 'tales', 'talk', 'team', 'tennis', 'term', 'text', 'than', 'that', 'the', 'then', 'third', 'thursday', 'time', 'title', 'to', 'today', 'tools', 'top', 'touched', 'trade', 'traffic', 'travel', 'trip', 'type', 'undiscovered', 'unified', 'union', 'unit', 'up', 'updated', 'urban', 'url', 'us', 'use', 'v', 'version', 'view', 'was', 'weather', 'week', 'widening', 'window', 'windows', 'word', 'works', 'world', 'www', 'xuhui', 'yangpu', 'years', 'yesterday', 'yuan', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Then remove words which includes in nltk.corpus.words.words().

>>> [w for w in words_list if not w in nltk.corpus.words.words()]
['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Then writing code based on the results above.

>>> def unknown(url):
...     raw = nltk.clean_html(urlopen(url).read())
...     words = re.findall(r'\w+', raw)
...     words_list = sorted(set([w.lower() for w in words2 if w.isalpha()]))
...     unknown_w = [w for w in words_list if not w in nltk.corpus.words.words()]
...     print unknown_w
... 

Let's test it.

>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> unknown(url)
['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Let's analyzing the results...

Plurals: blogs, captions, careers, cartoons, columnists...
Added ed, ing: opened, oriented, publishing...

These can be included in the vocabulary words (nltk.corpus.words.words()) by good stemming.

Proper nouns: english, chinese, saturday, thursday...

I guess, these words are listed here due to the side-effect of converting all words into lower cases. If the first characters are changed to upper case, these words could be listed in the vocabulary words.

Let me try this:

>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> raw = nltk.clean_html(urlopen(url).read())
>>> words_list = sorted(set([w.lower() for w in words2 if w.isalpha()]))    
>>> vocab_word = [w.lower() for w in nltk.corpus.words.words()] 
>>> [w for w in words_list if not w in vocab_word]
['app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

My assumption seems correct. The vocabulary list should also be converted into lower cases.

Chinese specific words: huangpu, huiyuan, jianmin, hongkou...

They are name or place name in China. This could be normal if not included in the vocabulary list.

New or internet specific words: app, blog, buzzword, emagazine, email, online, expo...

One possibility is that the vocabulary list is not up to date or not appropriate for this kind of internet-based news articles.