Exercise: Chapter 3 (7-9)

7.

>>> nltk.re_show(r'\b(a|an|the)\b', 'brian a then an the man')
brian {a} then {an} {the} man

Usage of '\b' is the key point, I think.

8.

>>> import urllib
>>> def cleantags(url):
...     raw_contents = urllib.urlopen(url).read()
...     return nltk.clean_html(raw_contents)
... 
>>> cleantags('http://www.nltk.org')
'Natural Language Toolkit — NLTK 2.0 documentation \n \n  \n 
\n \n \n \n \n \n  \n  \n \n \n  \n  \n   NLTK 2.0 documentation \n
   \n   next |\n   modules |\n   index \n   \n  \n  \n\n  \n  \n   
\n   \n  \n   \n   \n   \n \n Natural Language Toolkit \xc2\xb6 \n 
NLTK is a leading platform for building Python programs to work 
with human language data.\nIt provides easy-to-use interfaces to
 over 50 corpora and lexical resources such as WordNet,\nalong with
 a suite of text processing libraries for classification, 
tokenization, stemming, tagging, parsing, and semantic reasoning. 
\n Thanks to a hands-on guide introducing programming fundamentals
 alongside topics in computational linguistics,\nNLTK is suitable 
for linguists, engineers, students, educators, researchers, and 
industry users alike.\nNLTK is available for Windows, Mac OS X, 
and Linux. Best of all, NLTK is a free, open source, community-driven 
project. \n NLTK has been called “a wonderful tool for teaching,
 and working in, computational linguistics using Python,”\n
and “an amazing library to play with natural language.”
 \n Natural Language Processing with Python provides a practical\n
introduction to programming for language processing.\nWritten by 
the creators of NLTK, it guides the reader through the fundamentals\n
of writing Python programs, working with corpora, categorizing 
text, analyzing linguistic structure,\nand more. \n \n Some simple 
things you can do with NLTK \xc2\xb6 \n Tokenize and tag some text:
 \n >>> import nltk \n >>> sentence = "
""At eight o'clock on Thursday morning \n ... 
Arthur didn't feel very good.""" \n >>>
 tokens = nltk . word_tokenize ( sentence ) \n >>> tokens 
\n ['At', 'eight', "o'clock", 
'on', 'Thursday', 'morning', \n '
Arthur', 'did', "n't", 'feel',
 'very', 'good', '.'] \n >>> 
tagged = nltk . pos_tag ( tokens ) \n >>> tagged [ 0 : 6 ]
 \n [('At', 'IN'), ('eight', 'CD'),
 ("o'clock", 'JJ'), ('on', 'IN
'), \n ('Thursday', 'NNP'), ('morning'
, 'NN')] \n \n \n Identify named entities: \n >>> 
entities = nltk . chunk . ne_chunk ( tagged ) \n >>> 
entities \n Tree('S', [('At', 'IN'), 
('eight', 'CD'), ("o'clock", 'JJ
'), \n   ('on', 'IN'), ('Thursday', 
'NNP'), ('morning', 'NN'), \n  Tree('
PERSON', [('Arthur', 'NNP')]), \n   ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), \n   ('very', 'RB'), ('good', 'JJ'), ('.', '.')]) \n \n \n Display a parse
 tree: \n \n NB. If you publish work that uses NLTK, please cite the
 NLTK book as follows:\nBird, Steven, Edward Loper and Ewan Klein (2009).
\nNatural Language Processing with Python. O’Reilly Media Inc.
 \n \n \n Links \xc2\xb6 \n \n NLTK mailing list - release announcements
 only, very low volume \n NLTK-Users mailing list - user discussions \n
 NLTK-Dev mailing list - developers only \n NLTK-Translation mailing 
list - discussions about translating the NLTK book \n NLTK’s 
previous website \n NLTK development at GitHub \n Publications about 
NLTK \n \n \n \n \n Contents \xc2\xb6 \n \n \n NLTK News \n Installing
 NLTK \n Installing NLTK Data \n nltk Package \n Team NLTK \n \n \n 
\n Index \n Module Index \n Search Page \n \n \n\n\n   \n   \n  \n   
\n   \n   Table Of Contents \n   \n NLTK News \n Installing NLTK \n 
Installing NLTK Data \n nltk Package \n Team NLTK \n \n\n   Search 
\n   \n    \n    \n    \n    \n   \n   \n   Enter search terms or a 
module, class or function name.\n   \n   \n   \n  \n  \n\n  \n  \n
   \n   next |\n   modules |\n   index \n    \n    Show Source \n   
\n\n   \n   \n  \n  © Copyright 2012, NLTK Project.\n  Created 
using Sphinx 1.1.3.'
>>> 

Note: Inserted return to display entire results.

9-a.

>>> pnct_pattern = r'''(?x)     #set flag to allow verbose regexps
...     \.      #full stop
...     |,      #comma
...     |:      #colon
...     |;      #semicolon
...     |-      #dash
...     |/      #slash
...     |\?     #question
...     |\.\.\  #ellipsis
...     |[()]   #brackets
... '''

This contains some mistakes. Full stop(.) should not be the first, after the ellipsis(...). I also missed the third dot(.) at the ellipsis.

Define load(f).

>>> def load(f):
...     ofile = open(f)
...     raw = ofile.read()
...     return raw

Try nltk.regexp_tokenize().

>>> load('corpus.txt')
"Hello! My name is Ken Xu. Where are you from? I was born in (eastern part of) Japan. Every morning I go to Gym and ride on aero-bike, running machine or so on... It's my shame my writing english is not so good as yours. "
>>> text = load('corpus.txt')
>>> nltk.regexp_tokenize(text, pnct_pattern)
['.', '?', '(', ')', '.', '-', ',', '.', '.', '.', '.']

Revise mistakes then try again.

>>> pnct_pattern = r'''(?x)     #set flag to allow verbose regexps
...     ,       #comma
...     |[()]   #brackets
...     |\.\.\. #ellipsis
...     |\.     #full stop
...     |\?     #question
...     |/      #slash
...     |-      #dash
...     |;      #semicolon
...     |:      #colon
... '''
>>> nltk.regexp_tokenize(text, pnct_pattern)
['.', '?', '(', ')', '.', '-', ',', '...', '.']

9-b.

Need to revise corpus.txt

>>> text
"Hello! My name is Ken Xu. Where are you from? I was born in (eastern part of) Japan. Every morning I go to Gym and ride on aero-bike, running machine or so on... It's my shame my writing english is not so good as yours. \n\nOne of interesting thing is date format is different by regions. For example, we usually use this format in my country.\n\n2013/05/24\n\nBut I found the format was like 24.05.2013 in the ABC SYSTEMS. I heard this format is very popular in European countries. We can see similar situation in amount format. \n\nUSD12,345.67\nEUR12.345,67\nUSD123.45\nJPY123\n\n"

Amount:

>>> amnt_pattern = r'''(?x)
...     USD[\d,]+(.\d\d)?               #USD1.234,00
...     |EUR[\d.]+(,\d\d)?              #EUR1.234,00
...     |JPY[\d]+                       #JPY123
... '''
>>> text = load('corpus.txt')
>>> nltk.regexp_tokenize(text, amnt_pattern)
['USD12,345.67', 'EUR12.345,67', 'USD123.45', 'JPY123']

Date:

>>> date_pattern = r'''(?x)
...     \d{2}.\d{2}.\d{4}               #DD.MM.YYYY
...     |\d{4}/\d{2}/\d{2}              #YYYY.MM.DD
... '''
>>> nltk.regexp_tokenize(text, date_pattern)
['2013/05/24', '24.05.2013']

Name:

>>> name_pattern = r'''(?x)
...     [A-Z][a-z]+\s[A-Z][a-z]+
... '''
>>> nltk.regexp_tokenize(text, name_pattern)
['Ken Xu']

Organization:

I defined the organization name is written all upper cases as a condition.

>>> orgz_pattern = r'''(?x)
...     [A-Z]+(\s[A-Z])?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['H', 'M', 'K', 'X', 'W', 'I', 'J', 'E', 'I', 'G', 'I', 'O', 'F', 'B', 'I', 'ABC S', 'YSTEMS', 'I', 'E', 'W', 'USD', 'EUR', 'USD', 'JPY']
>>> orgz_pattern = r'''(?x)
...     [A-Z]+\b(\s[A-Z]+)?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['I', 'I', 'I', 'ABC SYSTEMS', 'I']
>>> orgz_pattern = r'''(?x)
...     [A-Z]{2,}\b(\s[A-Z]+)?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['ABC SYSTEMS']