Chunking (7.2)

Now I jump to Chapter 7.2

Noun Phrase Chunking (7.2.1)

>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
>>> result.draw()

f:id:deutschina:20130720084018p:plain

What I need to understand here is the meaning of grammar. This should be a kind of regular expression. A question mark ('?') is coming after

means
(determiner) is optional. Asterisk ('*') after is at least one time JJ (Adjective, I believe) before NN (Noun).

The result is that found 2 NPs, "the little yellow dog" and "the cat". It was slightly different from my expectation. Let me try some other patterns.

>>> sentence = [("little", "JJ"), ("yellow", "JJ"),
... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> cp = nltk.RegexpParser(grammer)
>>> result = cp.parse(sentence)
>>> print result
(S
  (NP little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
>>>
>>> sentence = [("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> cp = nltk.RegexpParser(grammer)                                             
>>> result = cp.parse(sentence)
>>> print result
(S (NP dog/NN) barked/VBD at/IN (NP the/DT cat/NN))
>>> 

OK. We can say both

and are optional.

Chunking with Regular Expressions (7.2.3)

>>> grammer = r"""
...     NP: {<DT|PP\$>?<JJ>*<NN>}
...         {<NNP>+}
... """
>>> cp = nltk.RegexpParser(grammer)
>>> sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
...     ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(sentence)
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))

Now I have recognized that I misunderstood the meaning of '?' and '*' in Regular expression. '?' means 0 or 1, '*' means 0 or larger. In this example, there are two patterns. The first one is similar to the last example but PP$ can be at the beginning as well as DT. The second condition is NNP, proper nouns.

One more example is here.

>>> nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
>>> grammar = "NP: {<NN><NN>}"
>>> cp = nltk.RegexpParser(grammar)
>>> print cp.parse(nouns)
(S (NP money/NN market/NN) fund/NN)
>>> grammar2 = "NP: {<NN>+}"
>>> cp = nltk.RegexpParser(grammar2)
>>> print cp.parse(nouns)
(S (NP money/NN market/NN fund/NN))

There are also two conditions here. The first one (grammar) is defined as NN is repeated twice. As a result, the third NN (fund) was put out of NP. The second one is no limitation to repeat. Therefore all NNs are in NP.