[CWB] CWB-Encode and tokenization

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Sep 20 13:35:28 CEST 2010


Alas, tokenisation is a non-trivial problem, and non-language-independant, so there's no way you can "not deal" with it. Even if a tokeniser was built into CWB (which is obviously not impossible, though nothing of the sort exists at the moment) you would still have to supply (for instance) the tokenisation rules, exceptions lexicon, etc.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Alberto Simões
Sent: 20 September 2010 11:56
To: Open source development of the Corpus WorkBench
Subject: [CWB] CWB-Encode and tokenization

Hello

So far, I used all my CWB input files in a tokenized form (one token per
line). Are there other formats that can be used, for example, making the
tokenization a task of CWB?

I am just asking because I am starting on the creation of a script to
encode directly a TMX file, but I would love if I could not deal with
tokenization :)

At the moment I may just split by space characters and pray :)

Thanks
Alberto
-- 
Alberto Simões
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list