[CWB] CWB-Encode and tokenization
Alberto Simões
albie at alfarrabio.di.uminho.pt
Mon Sep 20 13:46:15 CEST 2010
Hello
I also have my own tokenizer (Lingua::PT::PLNbase) that, although
written for Portuguese, works pretty well on most western europe
languages...
I just wanted not to deal with it o:-)
THanks
On 20/09/2010 12:40, Lukas Michelbacher wrote:
> Hello Alberto,
>
>> At the moment I may just split by space characters and pray :)
>
> Have you tried the tokenizaton methods in NLTK? They implemented the
> "Punkt" system described in [1]. It's quite easy to use:
>
> text = file.('abc').read()
>
> sent_tknzr = nltk.data.load('tokenizers/punkt/english.pickle')
> X = sent_tknzr.tokenize(text, realign_boundaries=True)
>
> word_tknzr = nltk.tokenize.punkt.PunktWordTokenizer()
> tokens = word_tknzr.tokenize(X)
>
> You can also add your own abbreviations for the sentence tokenizer in
> case it
> splits in the wrong place. I had to add some for common abbreviations in
> scientific text that the Punkt model didn't process properly.
>
> myabbrevs = set(('eq', 'eqs', 'i.e', '(i.e', '(e.g', 'fig', 'al', 'ref',
> 'refs', 'resp'))
> for p in myabbrevs:
> sent_tknzr._params.abbrev_types.add(p)
>
> NLTK comes with models for a number of languages [2].
>
> Hope this helps.
>
> Lukas
>
> [1] Tibor Kiss; Jan Strunk. 2006. Unsupervised Multilingual Sentence
> Boundary Detection.
> http://www.aclweb.org/anthology-new/J/J06/J06-4003.pdf
>
> [2] Czech, Danish, Dutch, English, Estonian, Finnish, French, German,
> Greek,
> Italian, Norwegian, Portuguese, Slovene, Spanish, Swedish and
> Turkish, to
> be precise ;).
>
> --
> Dipl.-Ling. Lukas Michelbacher
> Institute for Natural Language Processing
> University of Stuttgart
>
> phone: +49 (0)711-685-84587
> fax : +49 (0)711-685-81366
> email: michells at ims.uni-stuttgart.de
>
>
>
>
--
Alberto Simões
More information about the CWB
mailing list