[CWB] WACKy corpora and cwb

Andres Chandia andres at chandia.net
Tue Jan 28 00:36:05 CET 2014



Thanks a lot, I just did some little adjustments and now is fixing the corpora, you saved me a
lot of time, thanks again

On Mon, January 27, 2014 23:25, Stefan Evert wrote:
> 
>> Is there any easy way to transform the metadata format for the Wacky
corpora so that they
>> can be used with the cqpWeb interface? We are trying to
install a few of these corpora but I
>> have problems with some of the headings.
> 
> This is not a problem of the WaCky corpora in general.  Most of them are
provided in a format
> that's directly CWB-compatible.  Only sdeWaC has this different
and nonstandard format.
> 
> That's also why I happen to have a script named
"fix_sdewac_tagged.perl" on my computer. :-)
> 
> I'm attaching a
ZIP archive with this script as well as the CWB/Perl encoding script (and a
> second
script that annotates sentence lengths).  What you have to do is extract the
>
"web_address_list.txt" from the 7z archive (or download it separately), then run
> "extract_sdewac_tagged.sh".  If you want to keep it in UTF-8 encoding or
process an
> uncompressed version of the corpus, you'll have to edit the scripts
accordingly.
> 
> Hope this helps,
> Stefan
> 
> 



_______________________
            andrés
chandía

administrador de
parles.upf.edu
psicoaching.net
mapuche koyaktu
ong mapuche koyaktu
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140128/aea62147/attachment-0001.html>


More information about the CWB mailing list