Coding style (4.3)

This chapter of the whale book is something like very basic of coding style of Python. I just pick up some interesting examples.

There are two codes are included to get the same result.

>>> tokens = nltk.corpus.brown.words(categories='news')
>>> count = 0
>>> total = 0
>>> for token in tokens:
...     count += 1
...     total += len(token)
... 
>>> print total / count
4.40154543827
>>> total = sum(len(t) for t in tokens)
>>> print total / len(tokens)
4.40154543827

Another example:

>>> word_list = sorted(set(tokens))
>>> word_list

If do the same thing without functions, the code will be:

>>> word_list = []
>>> len_word_list = len(word_list)
>>> i = 0
>>> while i < len(tokens):
...     j = 0
...     while j < len_word_list and word_list[j] < tokens[i]:
...             j += 1
...     if j == 0 or tokens[i] != word_list[j]:
...             word_list.insert(j, tokens[i])
...             len_word_list += 1
...     i += 1
... 
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
IndexError: list index out of range

I got an error anyway...

>>> fd = nltk.FreqDist(nltk.corpus.brown.words())
>>> cumulative = 0.0
>>> for rank, word in enumerate(fd):
...     cumulative += fd[word] * 100 / fd.N()
...     print "%3d %6.2f%% %s" % (rank+1, cumulative, word)
...     if cumulative > 25:
...             break
... 
  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in
>>> 

Using enumuerate() instead of using loop.

Try to find the longest word in the corpus.

>>> text = nltk.corpus.gutenberg.words('milton-paradise.txt')
>>> longest = ''
>>> for word in text:
...     if len(word) > len(longest):
...             longest = word
... 
>>> longest
'unextinguishable'

The problem of above example is to get only one word even though there are several words with the same length(len). In this example, the first one to be selected. By changing to "if len(word) >= len(longest):", we are able to get the last one. Anyway, we can still get only one word.

>>> maxlen = max(len(word) for word in text)
>>> [word for word in text if len(word) == maxlen]
['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']
>>> 

In this example, get the maximum length first. Then select longest words.

>>> set = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> n = 3
>>> sent = set
>>> [sent[i:i+n] for i in range(len(sent)-n+1)]
[['The', 'dog', 'gave'], ['dog', 'gave', 'John'], ['gave', 'John', 'the'], ['John', 'the', 'newspaper']]

Creating an array with loop.

>>> array = [[set() for i in range(n)] for j in range(m)]
>>> array[2][5].add('Alice')
>>> array
[[set([]), set([]), set([]), set([]), set([]), set([]), set([])], [set([]), set([]), set([]), set([]), set([]), set([]), set([])], [set([]), set([]), set([]), set([]), set([]), set(['Alice']), set([])]]
>>> 

This example does not work as expected. The reason was explained at the earlier section of the textbook.

>>> array = [[set()] *n] *m
>>> array[2][5].add(7)
>>> import pprint
>>> pprint.pprint(array)
[[set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])]]
>>>