Other corpora - Deutschina's Tech Diary

Resume my study at Chapter 2.1.6 in O'Reilly's textbook.

The list of corpora is available in http://nltk.org/nltk_data/. In the textbook, Corpus HOWTO (http://www.nltk.org/howto) is also introduced, but I could not access to this link.

Corpora in other languages:

>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]
>>> nltk.corpus.floresta.words()
['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]
>>> nltk.corpus.indian.words('hindi.pos)
['\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3', '\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4\x82\xe0\xa4\xa7', ...]

Of course, all of them are listed in nltk_data.

CESS-ESP Treebank [ download | source ]
id: cess_esp; size: 2220392; author: ; copyright: ; license: If you use these corpora for research, please cite thusly: CESS-Cat project (M. Antonia Martí, MarionaTaulé, Lluís Márquez, Manuel Bertran (2007) ?CESS-ECE: A Multilingual and Multilevel Annotated Corpus? in http://www.lsi.upc.edu/~mbertran/cess-ece/publications).;

Portuguese Treebank [ download | source ]
id: floresta; size: 1882021; author: ; copyright: ; license: Non-commercial use only;

Indian Language POS-Tagged Corpus [ download | source ]
id: indian; size: 199214; author: A Kumaran; copyright: ; license: Distributed with permission;

Next example is udhr, "Universal Declaration of Human Rights" in more than 300 languages.

[code language="language="python'"]
>>> nltk.corpus.udhr.fileids()
['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1',
....
'Xhosa-Latin1', 'Yagua-Latin1', 'Yao-Latin1', 'Yapese-Latin1', 'Yoruba-UTF8', 'Zapoteco-Latin1', 'Zapoteco-SanLucasQuiavini-Latin1', 'Zhuang-Latin1', 'Zulu-Latin1']
>>> nltk.corpus.udhr.words'Javanese-Latin1')[11:]
[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]
[code]

Let's try in mother language, in Japanese...

[code language="python"]
>>> nltk.corpus.udhr.words('Japanese_Nihongo-UTF8')[11:]
[u'\u3008', u'\u524d\u6587', u'\u3009', ...]
|