[CWB] WACKy corpora and cwb

Andres Chandia andres at chandia.net
Mon Jan 27 19:23:13 CET 2014



Is there any easy way to transform the metadata format for the Wacky corpora so that they can
be used with the cqpWeb interface? We are trying to install a few of these corpora but I have
problems with some of the headings.

When I try to index (encode) I get the
following errors:

Malformed tag <source="10178"/>, inserted
literally (file /B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line
#633867).
Malformed tag <error="0.0185185185185185"/>, inserted literally
(file /B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line #633868).
Malformed tag <source="10183"/>, inserted literally (file
/B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line #633929).

This obviously has to do with the labels year, source and error, which don't have the
necessary closing.

<sentence>
<year>="0"/>
<source="1403"/>
<error="0.00869565217391304"/>
<s>
Sie    PPER    Sie|sie
dürfen    VMFIN    dürfen

I can do a few
transformations using PERL but I'm wondering whether there is something that could make this
easier and faster.

___________________
            andrés
chandía

administrador de
parles.upf.edu
psicoaching.net
mapuche koyaktu
ong mapuche koyaktu
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140127/e15b2891/attachment.html>


More information about the CWB mailing list