[CWB] CWB-Encode and tokenization

Alberto Simões albie at alfarrabio.di.uminho.pt
Mon Sep 20 13:46:15 CEST 2010


Hello

I also have my own tokenizer (Lingua::PT::PLNbase) that, although
written for Portuguese, works pretty well on most western europe
languages...

I just wanted not to deal with it o:-)

THanks

On 20/09/2010 12:40, Lukas Michelbacher wrote:
> Hello Alberto,
> 
>> At the moment I may just split by space characters and pray :)
> 
> Have you tried the tokenizaton methods in NLTK?  They implemented the
> "Punkt" system described in [1].  It's quite easy to use:
> 
> text = file.('abc').read()
> 
> sent_tknzr = nltk.data.load('tokenizers/punkt/english.pickle')
> X = sent_tknzr.tokenize(text, realign_boundaries=True)
> 
> word_tknzr = nltk.tokenize.punkt.PunktWordTokenizer()
> tokens = word_tknzr.tokenize(X)
> 
> You can also add your own abbreviations for the sentence tokenizer in
> case it
> splits in the wrong place.  I had to add some for common abbreviations in
> scientific text that the Punkt model didn't process properly.
> 
> myabbrevs = set(('eq', 'eqs', 'i.e', '(i.e', '(e.g', 'fig', 'al', 'ref',
> 'refs', 'resp'))
> for p in myabbrevs:
>     sent_tknzr._params.abbrev_types.add(p)
> 
> NLTK comes with models for a number of languages [2].
> 
> Hope this helps.
> 
> Lukas
> 
> [1] Tibor Kiss; Jan Strunk. 2006. Unsupervised Multilingual Sentence
>     Boundary Detection.
>     http://www.aclweb.org/anthology-new/J/J06/J06-4003.pdf
> 
> [2] Czech, Danish, Dutch, English, Estonian, Finnish, French, German,
> Greek,
>     Italian, Norwegian, Portuguese, Slovene, Spanish, Swedish and
> Turkish, to
>     be precise ;).
> 
> -- 
> Dipl.-Ling. Lukas Michelbacher
> Institute for Natural Language Processing
> University of Stuttgart
> 
> phone: +49 (0)711-685-84587
> fax  : +49 (0)711-685-81366
> email: michells at ims.uni-stuttgart.de
> 
> 
> 
> 

-- 
Alberto Simões


More information about the CWB mailing list