[CWB] xml files

Ingrid Sör ingrid.e.sor at gmail.com
Wed Dec 17 15:38:52 CET 2014


Hi,
Thanks Andrew for the regex. Changed it a bit for sed in my ubuntu terminal
and it seems to work fine.

Thanks Roland also for your email and apologies for perhaps sending my
questions to the wrong place. I could probably do with getting in touch
more with the guys at Språkbanken. I tried the Korp online search function,
but I couldn't seem to get it to find what I wanted. Thought I might be
able to do more from my own computer.

Getting back to your questions Roland, I am wanting to look at various
corpora, giving me a fairly representative frequency list over nouns in
Swedish, possibly trying to get data that in some way resembles that of
spoken Swedish, but without being transcripts of speech - therefore my
interest in blog texts among other things. I will probably want to try out
various combinations of corpora, both from web and newspapers among other
things - trying to end up with suitable nouns for a test of aphasia.
Downloaded from Språkbanken as xml rearranged.

Regarding searching in Korp, I maybe should give it another try. There
seemed to me to be several problems. I tried the *group *command for
instance, but couldn't get it to work (is this what you would use to get
frequency data for parts of speech?). Seems like the errors might have been
due to my inexperience though, as you say Korp is a good tool for getting
frequencies. It also took a long time (timed out) searching through 1G or
more of tokens, which would be an issue.

Thanks again for the help.
Best, Ingrid

On 17 December 2014 at 14:18, Roland Schäfer <roland.schaefer at fu-berlin.de>
wrote:
>
> Hi Ingrid,
>
> I have a certain feeling that this list is not the perfect place to ask
> this question, but:
>
> 1. Could you specify which corpora you are trying to process and from
> where you downloaded them?
>
> 2. Språkbanken's Korp interface is perfectly suited to extract frequency
> data. In my op, this is what it does best. Unless you need to do
> something very exotic or have thousands of nouns to look up, I'd suggest
> you try it.
>
> @Ruprecht: To the best of my knowledge, Språkbanken doesn't do TEI.
>
> Best,
> Roland
>
>
> PS: I just saw the snippet which you sent. This looks like the work of
> Språkbanken's corpus pipeline. Although I worked with it for a few
> months, I'm not 100% sure whether you could just take those files and do
> "make cwb" after installing that pipeline. Might be worth a try,
> though... Or just ask the guys over at SB.
>
>
> http://spraakbanken.gu.se/swe/forskning/infrastruktur/korp/distribution/corpuspipeline
>
> However, it can easily be processed with sed. Just get rid of the <w>
> tags as well.
>
> More importantly (and even more off-topic for this list [my
> apologies!]), I see you want to use the bloggmix corpus. If it is web
> data you are interested in, you can also try the 4.8 billion token
> SVCOW14AX corpus, available from "Corpora from the Web" (COW) for
> download in CWB format. It even comes with the appropriate commands for
> CWB import (cf. the README). You need to register in Colibri² at
> webcorpora.org, then log in and go to the download section:
>
> http://webcorpora.org/
> http://corporafromtheweb.org/svcow14/
>
>
>
> On 12/17/2014 11:25 AM, Ingrid Sör wrote:
> > Hi,
> >
> > I hope this is the right forum for my following questions..
> > I am trying to get frequency data of Swedish nouns from certain corpora
> > in the Swedish "Språkbanken". They have their files available for
> > download in xml format, so I am now trying to make them usable with CWB.
> > I read in the CWB encoding tutorial that the files need to be in
> > .vrt-format to encode them and that this can be done easily via XSLT.
> >
> > Is this the best way to go about things? I am not familiar with XSLT
> > really and I think it will take some time to learn how to do it on my
> > own, so if XSLT is the solution I would be very grateful if anyone might
> > have a "standard" xslt code for me to adapt. Or if there is any other
> > way? I have been using /sed /in my ubuntu terminal to get each tag or
> > word onto a new line, but this seems a complicated way to also make the
> > p-attributes tab-separated (as they are now inside <w> tags).
> >
> > Sorry if I am probably asking about rudimentary things now - I am very
> > new to CWB and corpus work. Thanks for any help!
> > Best regards,
> > Ingrid
> >
> >
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141217/a718b30c/attachment.html>


More information about the CWB mailing list