[CWB] xml files

Roland Schäfer roland.schaefer at fu-berlin.de
Wed Dec 17 14:18:40 CET 2014


Hi Ingrid,

I have a certain feeling that this list is not the perfect place to ask
this question, but:

1. Could you specify which corpora you are trying to process and from
where you downloaded them?

2. Språkbanken's Korp interface is perfectly suited to extract frequency
data. In my op, this is what it does best. Unless you need to do
something very exotic or have thousands of nouns to look up, I'd suggest
you try it.

@Ruprecht: To the best of my knowledge, Språkbanken doesn't do TEI.

Best,
Roland


PS: I just saw the snippet which you sent. This looks like the work of
Språkbanken's corpus pipeline. Although I worked with it for a few
months, I'm not 100% sure whether you could just take those files and do
"make cwb" after installing that pipeline. Might be worth a try,
though... Or just ask the guys over at SB.

http://spraakbanken.gu.se/swe/forskning/infrastruktur/korp/distribution/corpuspipeline

However, it can easily be processed with sed. Just get rid of the <w>
tags as well.

More importantly (and even more off-topic for this list [my
apologies!]), I see you want to use the bloggmix corpus. If it is web
data you are interested in, you can also try the 4.8 billion token
SVCOW14AX corpus, available from "Corpora from the Web" (COW) for
download in CWB format. It even comes with the appropriate commands for
CWB import (cf. the README). You need to register in Colibri² at
webcorpora.org, then log in and go to the download section:

http://webcorpora.org/
http://corporafromtheweb.org/svcow14/



On 12/17/2014 11:25 AM, Ingrid Sör wrote:
> Hi,
> 
> I hope this is the right forum for my following questions..
> I am trying to get frequency data of Swedish nouns from certain corpora
> in the Swedish "Språkbanken". They have their files available for
> download in xml format, so I am now trying to make them usable with CWB.
> I read in the CWB encoding tutorial that the files need to be in
> .vrt-format to encode them and that this can be done easily via XSLT.
> 
> Is this the best way to go about things? I am not familiar with XSLT
> really and I think it will take some time to learn how to do it on my
> own, so if XSLT is the solution I would be very grateful if anyone might
> have a "standard" xslt code for me to adapt. Or if there is any other
> way? I have been using /sed /in my ubuntu terminal to get each tag or
> word onto a new line, but this seems a complicated way to also make the
> p-attributes tab-separated (as they are now inside <w> tags).
> 
> Sorry if I am probably asking about rudimentary things now - I am very
> new to CWB and corpus work. Thanks for any help!
> Best regards,
> Ingrid
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 


More information about the CWB mailing list