[CWB] CWB-Encode and tokenization

Serge Sharoff S.Sharoff at leeds.ac.uk
Mon Sep 20 14:39:47 CEST 2010


the easiest solution is to assume Treetagger is installed.  It contains utf8-tokenizer.perl, which is fairly reasonable for tokenisation of a range of languages (some tokenisation rules in it are hard-coded, others are supplied by word lists).
Serge
________________________________________
From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew [a.hardie at lancaster.ac.uk]
Sent: 20 September 2010 12:35
To: Open source development of the Corpus WorkBench
Subject: RE: [CWB] CWB-Encode and tokenization

Alas, tokenisation is a non-trivial problem, and non-language-independant, so there's no way you can "not deal" with it. Even if a tokeniser was built into CWB (which is obviously not impossible, though nothing of the sort exists at the moment) you would still have to supply (for instance) the tokenisation rules, exceptions lexicon, etc.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Alberto Simões
Sent: 20 September 2010 11:56
To: Open source development of the Corpus WorkBench
Subject: [CWB] CWB-Encode and tokenization

Hello

So far, I used all my CWB input files in a tokenized form (one token per
line). Are there other formats that can be used, for example, making the
tokenization a task of CWB?

I am just asking because I am starting on the creation of a script to
encode directly a TMX file, but I would love if I could not deal with
tokenization :)

At the moment I may just split by space characters and pray :)

Thanks
Alberto
--
Alberto Simões
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list