[CWB] CWB-Encode and tokenization
Lukas Michelbacher
michells at ims.uni-stuttgart.de
Mon Sep 20 13:40:26 CEST 2010
Hello Alberto,
> At the moment I may just split by space characters and pray :)
Have you tried the tokenizaton methods in NLTK? They implemented the
"Punkt" system described in [1]. It's quite easy to use:
text = file.('abc').read()
sent_tknzr = nltk.data.load('tokenizers/punkt/english.pickle')
X = sent_tknzr.tokenize(text, realign_boundaries=True)
word_tknzr = nltk.tokenize.punkt.PunktWordTokenizer()
tokens = word_tknzr.tokenize(X)
You can also add your own abbreviations for the sentence tokenizer in case it
splits in the wrong place. I had to add some for common abbreviations in
scientific text that the Punkt model didn't process properly.
myabbrevs = set(('eq', 'eqs', 'i.e', '(i.e', '(e.g', 'fig', 'al', 'ref',
'refs', 'resp'))
for p in myabbrevs:
sent_tknzr._params.abbrev_types.add(p)
NLTK comes with models for a number of languages [2].
Hope this helps.
Lukas
[1] Tibor Kiss; Jan Strunk. 2006. Unsupervised Multilingual Sentence
Boundary Detection.
http://www.aclweb.org/anthology-new/J/J06/J06-4003.pdf
[2] Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek,
Italian, Norwegian, Portuguese, Slovene, Spanish, Swedish and Turkish, to
be precise ;).
--
Dipl.-Ling. Lukas Michelbacher
Institute for Natural Language Processing
University of Stuttgart
phone: +49 (0)711-685-84587
fax : +49 (0)711-685-81366
email: michells at ims.uni-stuttgart.de
More information about the CWB
mailing list