[CWB] WACKy corpora and cwb

Stefan Evert stefanML at collocations.de
Mon Jan 27 23:25:29 CET 2014


> Is there any easy way to transform the metadata format for the Wacky corpora so that they can be used with the cqpWeb interface? We are trying to install a few of these corpora but I have problems with some of the headings.

This is not a problem of the WaCky corpora in general.  Most of them are provided in a format that's directly CWB-compatible.  Only sdeWaC has this different and nonstandard format.

That's also why I happen to have a script named "fix_sdewac_tagged.perl" on my computer. :-)

I'm attaching a ZIP archive with this script as well as the CWB/Perl encoding script (and a second script that annotates sentence lengths).  What you have to do is extract the "web_address_list.txt" from the 7z archive (or download it separately), then run "extract_sdewac_tagged.sh".  If you want to keep it in UTF-8 encoding or process an uncompressed version of the corpus, you'll have to edit the scripts accordingly.

Hope this helps,
Stefan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sdewac-tools.zip
Type: application/zip
Size: 3637 bytes
Desc: not available
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140127/cf52e5b2/attachment.zip>


More information about the CWB mailing list