[CWB] other kind of annotations in cwb corpus

Yannick Versley yversley at gmail.com
Tue Feb 15 13:39:03 CET 2011


Hi Luigi,

the easiest way to get tokenized and tagged data from raw text would be
to use an existing toolkit, such as
TextPro (http://textpro.fbk.eu/) or Tanl (http://medialab.di.unipi.it/wiki/Tanl)
that does tokenization, POS tagging and lemmatization for you, and then
possibly fix the format so that cwb-encode is happy with it.

You cannot feed raw text directly to cwb-encode, because it needs the
tokenization -
and once you run something as simple as "replace every space with a line break",
you could just as well use one of the existing linguistic pipelines since the
quality of their output (including POS, lemma, morph) is almost always
very usable.

If you want to do everything by hand, then starting with "replace
every space with
a newline" and doing the rest by hand can also get you somewhere.

Best,
Yannick

On Tue, Feb 15, 2011 at 1:13 PM, luigi.talamo at libero.it
<luigi.talamo at libero.it> wrote:
>  Hi there! :)
>
> yannick wrote:
>>The (conceptually) simpler way to do this would be to dump the whole corpus
> (using cwb-decode), run your favorite tools on it to get a version with the
>>additionalannotations, and then replace the old data directory with the cwb-
> encode'd versionof your new, enriched version of the corpus.
>
> Ok, I forgot to tell you that I'll probably start with a fresh corpus i.e. a
> corpus which is not encoded yet in cwb (and probably lacks any sort of xml
> encoding).
> So, if I begin with a, say, bare txt, I'll only need to process it through cwb-
> encode, right?
> Which documentation should I read to prepare my collected data? In the 'alpha'
> version of my corpus I just want to have the additional annotation and the
> lemmatization: I can leave the pos tagging to further releases of the corpus.
> Is it possible to do that?
>
> Thank you,
>
> Luigi
>
>
>
>
>
>
>


More information about the CWB mailing list