Slicing text (3.2.3-3.2.6)

Continue from yesterday as of chapter 3.2.3 of the whale book.

Slicing can be used not only in list but also in text. This is already checked in chapter 1 as well.

>>> print monty
Monty Python
>>> monty[0]
'M'
>>> monty[3]
't'
>>> monty[5]
' '
>>> monty[20]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 

Negative value is also acceptable.

>>> monty[-1]
'n'
>>> monty[5]
' '
>>> monty[-7]
' '
>>> monty[-6:]
'Python'
>>> 

This example is processing characters one by one in text. Also comparing the result with/without(,) at the end of print.

>>> sent = 'colorless green ideas sleep furiously'
>>> for char in sent:
...     print char,
... 
c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y
>>> for char in sent:
...     print char
... 
c
o
l
o
r
l
e
s
s
 
g
r
e
e
n
 
i
d
e
a
s
 
s
l
e
e
p
 
f
u
r
i
o
u
s
l
y
>>> 

With comma(,), the system not change to a new line. Continuously output just after previous one in the same line.

This is another example to process one by one. Count frequency of each alphabet.

>>> from nltk.corpus import gutenberg
>>> raw = gutenberg.raw('melville-moby_dick.txt')
>>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
>>> fdist.keys()
['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']
>>> fdist.plot()

figure_1

Using ranges for slicing. One tricky thing is that the characters are extracted from beginning of the range to one before the end of the range.

>>> monty[6:10]
'Pyth'
>>> monty[-12:-7]
'Monty'
>>> monty[:5]
'Monty'
>>> monty[6:]
'Python'

For example, monty[6:10] extracts text from monty[6] to monty[9], monty[10] is not included. Therefore the output will not be 'pytho' but 'pyth' as displayed above. If omitting the beginning of the range, the system will pick up from the first element[0]. If omitting the end of the range, to be extracted to the end of the element.

in is useful to check whether specific characters (or words) are included.

>>> phrase = 'And now for something completely different'
>>> if 'thing' in phrase:
...     print 'found "thing"'
... 
found "thing"

find returns the location of the specific characters.

>>> monty.find('Python')
6
>>> monty[6:12]
'Python'
>>> 

To get help document:

>>> help(str)

Difference between list and str (3.2.6):

>>> query 
'Who knows?'
>>> beatles
['john', 'paul', 'george', 'ringo']
>>> query[2]
'o'
>>> beatles[2]
'george'
>>> query[:2]
'Wh'
>>> beatles[:2]
['john', 'paul']
>>> query + "I don't"
"Who knows?I don't"
>>> beatles + 'brian'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "str") to list
>>> beatles + ['brian']
['john', 'paul', 'george', 'ringo', 'brian']
>>>
>>> beatles[0] = "John Lennon"
>>> del beatles[-1]
>>> beatles
['John Lennon', 'paul', 'george']
>>> query[0] = 'F'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>