O'Reilly: Chapter 1 Exercise 17-22

Continuing Chapter 1 Exercise...

17.

>>> text9.index('sunset')
629
>>> text9[620:630]
['PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset']
>>> text9[620:635]
['PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as']
>>> text9[620:640]
['PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a']
>>> text9[620:645]
['PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a', 'cloud', 'of', 'sunset', '.', 'It']
>>> text9[615:644]
['THE', 'TWO', 'POETS', 'OF', 'SAFFRON', 'PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a', 'cloud', 'of', 'sunset', '.']
>>> text9[610:644]
['.', 'C', '.', 'CHAPTER', 'I', 'THE', 'TWO', 'POETS', 'OF', 'SAFFRON', 'PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a', 'cloud', 'of', 'sunset', '.']

First get the index in text9 (629) then try to get a whole sentence by slicing. I found a period (.) at text9(644) then tried to find the beginning of the sentence by changing the range.

The found sentence is:

THE suburb of Saffron Park lay on the sunset side of London, as red and ragged as a cloud of sunset.

18.

>>> sentset = sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8 + sent9
>>> len(sentset)
122
>>> len(set(sentset))
87
>>> sorted(set(sentset))
['!', ',', '-', '.', '1', '25', '29', '61', ':', 'ARTHUR', 'Call', 'Citizens', 'Dashwood', 'Fellow', 'God', 'House', 'I', 'In', 'Ishmael', 'JOIN', 'KING', 'London', 'MALE', 'Nov.', 'PMing', 'Park', 'Pierre', 'Representatives', 'SCENE', 'SEXY', 'Saffron', 'Senate', 'Sussex', 'THE', 'The', 'Vinken', 'Whoa', '[', ']', 'a', 'and', 'as', 'attrac', 'been', 'beginning', 'board', 'clop', 'cloud', 'created', 'director', 'discreet', 'earth', 'encounters', 'family', 'for', 'had', 'have', 'heaven', 'in', 'join', 'lady', 'lay', 'lol', 'long', 'me', 'nonexecutive', 'of', 'old', 'older', 'on', 'people', 'problem', 'ragged', 'red', 'seeks', 'settled', 'side', 'single', 'suburb', 'sunset', 'the', 'there', 'to', 'will', 'wind', 'with', 'years']

From sent1 to sent9, 87 words included. (Note: original question in the textbook is from sent1 to sent8)

19.

>>> sorted(set([w.lower() for w in text1]))
['!', '!"', '!"--', "!'", '!\'"', '!)', '!)"', '!*', '!--', '!--"', "!--'", '"', '"\'', '"--', '"...', '";', '$', '&', "'",...

>>> sorted(set([w.lower() for w in set(text1)]))
['!', '!"', '!"--', "!'", '!\'"', '!)', '!)"', '!*', '!--', '!--"', "!--'", '"', '"\'', '"--', '"...', '";', '$', '&', "'", "',", "',--",...

>>> len(sorted(set([w.lower() for w in set(text1)])))
17231
>>> len(sorted(set([w.lower() for w in text1])))
17231

As very long list was returned, I counted the length with "len". They are same length. The first one is to remove duplicated words (set) before converting to lower cases. However "set" is also used after converting to lower cases.

For sure, check with other example.

>>> len(sorted(set([w.lower() for w in text2])))
6403
>>> len(sorted(set([w.lower() for w in set(text2)])))
6403

>>> len(sorted(set([w.lower() for w in set(text3)])))
2628
>>> len(sorted(set([w.lower() for w in text3])))
2628

20.

>>> [w for w in text1 if w.isupper()]
['ETYMOLOGY', 'I', 'H', 'HACKLUYT', 'WHALE', 'HVAL', 'HVALT', 'WEBSTER', 'S', 'DICTIONARY', 'WHALE', 'WALLEN', 'A', 'S', 'WALW', 'IAN', 'RICHARDSON', 'S', 'DICTIONARY', 'KETOS', 'GREEK', 'CETUS', 'LATIN', 'WHOEL', 'ANGLO', 'SAXON', 'HVALT', 'DANISH', 'WAL', 'DUTCH', 'HWAL', 'SWEDISH', 'WHALE', 'ICELANDIC', 'WHALE', 'ENGLISH', 'BALEINE', 'FRENCH', 'BALLENA',...

>>> [w for w in text1 if not w.islower()]
['[', 'Moby', 'Dick', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'Late', 'Consumptive', 'Usher', 'Grammar', 'School', ')', 'The', 'Usher',...

w.isupper() can extract words only with upper case (Note: punctation is also include). On the other hand, not w.islower() extracts words which contain not lower case characters.

21.

>>> text2[-2:]
['THE', 'END']

22. Extract only top 50 as there are more than 10,000 words in the list.

>>> fdist1 = FreqDist([w for w in text5 if len(w) == 4])
>>> fdist1
<FreqDist with 1181 samples and 10204 outcomes>
>>> vocab1 = fdist1.keys()
>>> vocab1[:50]
['JOIN', 'PART', 'that', 'what', 'here', '....', 'have', 'like', 'with', 'chat', 'your', 'good', 'just', 'lmao', 'know', 'room', 'from', 'this', 'well', 'back', 'hiya', 'they', 'dont', 'yeah', 'want', 'love', 'guys', 'some', 'been', 'talk', 'nice', 'time', 'when', 'haha', 'make', 'girl', 'need', 'U122', 'MODE', 'much', 'then', 'will', 'over', 'were', 'work', 'take', 'U115', 'U121', 'song', 'U105']

Let's count frequency of some words.

>>> fdist1['JOIN']
1021
>>> fdist1['song']
36