[CWB] CWB-Encode and tokenization

Mon Sep 20 13:40:26 CEST 2010

Hello Alberto,

> At the moment I may just split by space characters and pray :)

Have you tried the tokenizaton methods in NLTK?  They implemented the
"Punkt" system described in [1].  It's quite easy to use:

text = file.('abc').read()

sent_tknzr = nltk.data.load('tokenizers/punkt/english.pickle')
X = sent_tknzr.tokenize(text, realign_boundaries=True)

word_tknzr = nltk.tokenize.punkt.PunktWordTokenizer()
tokens = word_tknzr.tokenize(X)

You can also add your own abbreviations for the sentence tokenizer in case it
splits in the wrong place.  I had to add some for common abbreviations in
scientific text that the Punkt model didn't process properly.

myabbrevs = set(('eq', 'eqs', 'i.e', '(i.e', '(e.g', 'fig', 'al', 'ref',
'refs', 'resp'))
for p in myabbrevs:
     sent_tknzr._params.abbrev_types.add(p)

NLTK comes with models for a number of languages [2].

Hope this helps.

Lukas

[1] Tibor Kiss; Jan Strunk. 2006. Unsupervised Multilingual Sentence
     Boundary Detection.
     http://www.aclweb.org/anthology-new/J/J06/J06-4003.pdf

[2] Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek,
     Italian, Norwegian, Portuguese, Slovene, Spanish, Swedish and Turkish, to
     be precise ;).

--
Dipl.-Ling. Lukas Michelbacher
Institute for Natural Language Processing
University of Stuttgart

phone: +49 (0)711-685-84587
fax  : +49 (0)711-685-81366
email: michells at ims.uni-stuttgart.de