[CWB] WACKy corpora and cwb

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jan 27 19:33:31 CET 2014


>> This obviously has to do with the labels year, source and error, which don't have the necessary closing

No, it's because they are not XML, but only pseudo-XML: no attribute name is given. It's illegal XML to link a value to the tag identifier with an =. There needs to be a separate attribute name.

The fact that the process gets 600K lines into the corpus before hitting this error suggests that this error may not be found in most of the corpus, maybe? So perhaps the earlier texts will give you an example of what this is supposed to look like.

Note that even if this is corrected, it does not necessarily mean it will work as you wish in CQPweb, as CWB in general and CQPweb in particular do not have unrestricted XML support.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Andres Chandia
Sent: 27 January 2014 18:23
To: Open source development of the Corpus WorkBench
Subject: [CWB] WACKy corpora and cwb

Is there any easy way to transform the metadata format for the Wacky corpora so that they can be used with the cqpWeb interface? We are trying to install a few of these corpora but I have problems with some of the headings.

When I try to index (encode) I get the following errors:

Malformed tag <source="10178"/>, inserted literally (file /B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line #633867).
Malformed tag <error="0.0185185185185185"/>, inserted literally (file /B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line #633868).
Malformed tag <source="10183"/>, inserted literally (file /B_NFS_P/resources/corpora/written/data/de/sdewac/sdewac-v3.tagged, line #633929).

This obviously has to do with the labels year, source and error, which don't have the necessary closing.

<sentence>
<year>="0"/>
<source="1403"/>
<error="0.00869565217391304"/>
<s>
Sie    PPER    Sie|sie
dürfen    VMFIN    dürfen

I can do a few transformations using PERL but I'm wondering whether there is something that could make this easier and faster.

___________________
            andrés chandía
[Image removed by sender. chandia.net]<http://www.chandia.net>[Image removed by sender.]<https://twitter.com/andreschandia>
administrador de
parles.upf.edu<http://parles.upf.edu>
psicoaching.net<http://psicoaching.net>
mapuche koyaktu<http://koyaktumapuche.net>
ong mapuche koyaktu<http://corporacionkoyaktu.net>
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140127/d5a95da4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ~WRD331.jpg
Type: image/jpeg
Size: 823 bytes
Desc: ~WRD331.jpg
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140127/d5a95da4/attachment.jpg>


More information about the CWB mailing list