[CWB] xml fiels

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Dec 17 14:08:11 CET 2014


This data isn’t TEI, but it looks relatively easy to convert because it already has a line-by-line layout.

You can do it with the following regex search-and-replace:

Search pattern:

<w pos="(.*?)" msd="(.*?)" lemma="(.*?)" lex="(.*?)" saldo="(.*?)" prefix="(.*?)" suffix="(.*?)" ref="(.*?)" dephead="(.*?)" deprel="(.*?)">(.*?)</w>

Replace with:

$11\t$1\t$2\t$3\t$4\t$5\t$6\t$7\t$8\t$9\t$10\t

(you can also omit some of the annotation fields if you do not need them. The above pattern is written for PCRE, but other regex flavours should be similar)

hope that helps

best

Andrew/

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ingrid Sör
Sent: 17 December 2014 13:00
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] CWB Digest, Vol 95, Issue 6

Thanks for your reply Ruprecht.
I am sending you a short excerpt of the beginning of one corpus, as I can't find information regarding if they are TEI or not and can't tell myself. If you can see that it is TEI, I would be very happy to try your XSLT script - very kind of you to share your code.
Best, Ingrid


On 17 December 2014 at 12:21, Ruprecht von Waldenfels <waldenfels at issl.unibe.ch<mailto:waldenfels at issl.unibe.ch>> wrote:
Hi,
if this is TEI, I can send you my XSLT script.
Best,
Ruprecht
Am 17.12.2014 um 12:00 schrieb cwb-request at sslmit.unibo.it<mailto:cwb-request at sslmit.unibo.it>:

Send CWB mailing list submissions to

  cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>



To subscribe or unsubscribe via the World Wide Web, visit

  http://devel.sslmit.unibo.it/mailman/listinfo/cwb

or, via email, send a message with subject or body 'help' to

  cwb-request at sslmit.unibo.it<mailto:cwb-request at sslmit.unibo.it>



You can reach the person managing the list at

  cwb-owner at sslmit.unibo.it<mailto:cwb-owner at sslmit.unibo.it>



When replying, please edit your Subject line so it is more specific

than "Re: Contents of CWB digest..."


Today's Topics:



   1. Bug report-CQPweb 3.1.11 (Umut Demirhan)

   2. Re: Bug report-CQPweb 3.1.11 (Hardie, Andrew)

   3. xml files (Ingrid S?r)


_______________________________________________

CWB mailing list

CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>

http://devel.sslmit.unibo.it/mailman/listinfo/cwb

Hi,
I hope this is the right forum for my following questions..
I am trying to get frequency data of Swedish nouns from certain corpora in the Swedish "Språkbanken". They have their files available for download in xml format, so I am now trying to make them usable with CWB. I read in the CWB encoding tutorial that the files need to be in .vrt-format to encode them and that this can be done easily via XSLT.
Is this the best way to go about things? I am not familiar with XSLT really and I think it will take some time to learn how to do it on my own, so if XSLT is the solution I would be very grateful if anyone might have a "standard" xslt code for me to adapt. Or if there is any other way? I have been using sed in my ubuntu terminal to get each tag or word onto a new line, but this seems a complicated way to also make the p-attributes tab-separated (as they are now inside <w> tags).
Sorry if I am probably asking about rudimentary things now - I am very new to CWB and corpus work. Thanks for any help!
Best regards,
Ingrid

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141217/d75c6b4d/attachment.html>


More information about the CWB mailing list