Chunking (7.2)
Now I jump to Chapter 7.2
Noun Phrase Chunking (7.2.1)
>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] >>> grammar = "NP: {<DT>?<JJ>*<NN>}" >>> cp = nltk.RegexpParser(grammar) >>> result = cp.parse(sentence) >>> print result (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) >>> result.draw()
What I need to understand here is the meaning of grammar. This should be a kind of regular expression. A question mark ('?') is coming after
The result is that found 2 NPs, "the little yellow dog" and "the cat". It was slightly different from my expectation. Let me try some other patterns.
>>> sentence = [("little", "JJ"), ("yellow", "JJ"), ... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] >>> cp = nltk.RegexpParser(grammer) >>> result = cp.parse(sentence) >>> print result (S (NP little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) >>> >>> sentence = [("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] >>> cp = nltk.RegexpParser(grammer) >>> result = cp.parse(sentence) >>> print result (S (NP dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) >>>
OK. We can say both
Chunking with Regular Expressions (7.2.3)
>>> grammer = r""" ... NP: {<DT|PP\$>?<JJ>*<NN>} ... {<NNP>+} ... """ >>> cp = nltk.RegexpParser(grammer) >>> sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ... ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")] >>> print cp.parse(sentence) (S (NP Rapunzel/NNP) let/VBD down/RP (NP her/PP$ long/JJ golden/JJ hair/NN))
Now I have recognized that I misunderstood the meaning of '?' and '*' in Regular expression. '?' means 0 or 1, '*' means 0 or larger. In this example, there are two patterns. The first one is similar to the last example but PP$ can be at the beginning as well as DT. The second condition is NNP, proper nouns.
One more example is here.
>>> nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] >>> grammar = "NP: {<NN><NN>}" >>> cp = nltk.RegexpParser(grammar) >>> print cp.parse(nouns) (S (NP money/NN market/NN) fund/NN) >>> grammar2 = "NP: {<NN>+}" >>> cp = nltk.RegexpParser(grammar2) >>> print cp.parse(nouns) (S (NP money/NN market/NN fund/NN))
There are also two conditions here. The first one (grammar) is defined as NN is repeated twice. As a result, the third NN (fund) was put out of NP. The second one is no limitation to repeat. Therefore all NNs are in NP.