Slicing text (3.2.3-3.2.6) - Deutschina's Tech Diary

Continue from yesterday as of chapter 3.2.3 of the whale book.

Slicing can be used not only in list but also in text. This is already checked in chapter 1 as well.

>>> print monty
Monty Python
>>> monty[0]
'M'
>>> monty[3]
't'
>>> monty[5]
' '
>>> monty[20]
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
IndexError: string index out of range
>>>

Negative value is also acceptable.

>>> monty[-1]
'n'
>>> monty[5]
' '
>>> monty[-7]
' '
>>> monty[-6:]
'Python'
>>>

This example is processing characters one by one in text. Also comparing the result with/without(,) at the end of print.

>>> sent = 'colorless green ideas sleep furiously'
>>> for char in sent:
...     print char,
... 
c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y
>>> for char in sent:
...     print char
... 
c
o
l
o
r
l
e
s
s
 
g
r
e
e
n
 
i
d
e
a
s
 
s
l
e
e
p
 
f
u
r
i
o
u
s
l
y
>>>

With comma(,), the system not change to a new line. Continuously output just after previous one in the same line.

This is another example to process one by one. Count frequency of each alphabet.

>>> from nltk.corpus import gutenberg
>>> raw = gutenberg.raw('melville-moby_dick.txt')
>>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
>>> fdist.keys()
['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']
>>> fdist.plot()

Using ranges for slicing. One tricky thing is that the characters are extracted from beginning of the range to one before the end of the range.

>>> monty[6:10]
'Pyth'
>>> monty[-12:-7]
'Monty'
>>> monty[:5]
'Monty'
>>> monty[6:]
'Python'

For example, monty[6:10] extracts text from monty[6] to monty[9], monty[10] is not included. Therefore the output will not be 'pytho' but 'pyth' as displayed above. If omitting the beginning of the range, the system will pick up from the first element[0]. If omitting the end of the range, to be extracted to the end of the element.

in is useful to check whether specific characters (or words) are included.

>>> phrase = 'And now for something completely different'
>>> if 'thing' in phrase:
...     print 'found "thing"'
... 
found "thing"

find returns the location of the specific characters.

>>> monty.find('Python')
6
>>> monty[6:12]
'Python'
>>>

To get help document:

>>> help(str)

Difference between list and str (3.2.6):

>>> query 
'Who knows?'
>>> beatles
['john', 'paul', 'george', 'ringo']
>>> query[2]
'o'
>>> beatles[2]
'george'
>>> query[:2]
'Wh'
>>> beatles[:2]
['john', 'paul']
>>> query + "I don't"
"Who knows?I don't"
>>> beatles + 'brian'
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
TypeError: can only concatenate list (not "str") to list
>>> beatles + ['brian']
['john', 'paul', 'george', 'ringo', 'brian']
>>>
>>> beatles[0] = "John Lennon"
>>> del beatles[-1]
>>> beatles
['John Lennon', 'paul', 'george']
>>> query[0] = 'F'
Traceback (most recent call last):
  File "&lt;stdin>", line 1, in &lt;module>
TypeError: 'str' object does not support item assignment
>>>