[CWB] Parallel Corpora

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Jun 10 15:43:39 CEST 2016


It looks like your input file has whitespace at the start of the lines containing the <seg> and <s> elements.

Is this the case, or is it just an artefact of the email?

IF there really is whitespace there, that is your problem. XML tags should not be preceded on the line by whitespace.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Philippe Baudrion
Sent: 10 June 2016 14:41
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Parallel Corpora

Thank you Susanne for your quick answer.
Until now I have only tried automatic indexing through CQPweb.
I guess I will need to dig a bit more CQP encoding options in order to have it work.
Thank you for putting me on the right track, Philippe
On 06/10/2016 02:54 PM, Susanne Flach wrote:
Dear Philippe,

Have you tried declaring nested XML elements with :0 as described in Sec 4?
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html

I’ve never had your problem, but I have always used the :0.

Best,
Susanne

--
Susanne Flach, M.A.
Arbeitsbereich Linguistik
Institut für Englische Philologie
Freie Universität Berlin
Habelschwerdter Allee 45
14195 Berlin
NEU! Korpustutorium mit CQP<http://userpage.fu-berlin.de/%7Eflach/corpling/>

http://userpage.fu-berlin.de/~flach/<http://userpage.fu-berlin.de/%7Eflach/>

Raum JK29/223
Telefon +49 30 838 72311

On 10 Jun 2016, at 14:39, Philippe Baudrion <Philippe.Baudrion at unige.ch<mailto:Philippe.Baudrion at unige.ch>> wrote:

Dear all,
I am trying to index the following corpus structure but it is not working. Here is an extract of the corpus:

<text id="FR_DI_2000_1" organisation="CERD" country="Francia" type="Documento informativo" year="2000" signature="CERD/C/SR.1373">
    <s id="1">
        <seg lang="fr">
La
séance
est
ouverte
à
10h05
.
</seg>
        <seg lang="es">
Se
declara
abierta
la
sesión
a
las
10.05
horas
.
        </seg>
    </s>
...
</text>

The corresponding files on the disk drive remains empty:

> ll /export/data/CQPweb_data/corpus/test_pb_fr_es/

          total 120

          drwxr-xr-x  2 www-data www-data 4096 Jun  6 12:18 ./

          drwxrwxr-x 58 www-data letrint  4096 Jun  6 12:18 ../

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avs

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avx

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.rng

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg.rng

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avs

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avx

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.rng

          -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s.rng

          -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avs

          -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avx

          -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.rng

          -rw-r--r--  1 www-data www-data   13 Jun  6 12:18 text_id.avs

          -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.avx

          -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.rng

          ...

The indexing command is as follow:

> cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es"  -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1

> cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1



I guess due to the redundence of the <seg> element it is impossible to correctely index that corpus, but I want to have your opinion on that.

In case it is possible, what would then be the correct indexing command.



Thank you for your help, greetings,

--

Baudrion Philippe

Correspondant Informatique



UNIVERSITE DE GENEVE

Faculté de traduction et d'interprétation

40, bd. du Pont d'Arve

1211 GENEVE 4



Tél +41 22 379 94 95
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb




--

Baudrion Philippe

Correspondant Informatique



UNIVERSITE DE GENEVE

Faculté de traduction et d'interprétation

40, bd. du Pont d'Arve

1211 GENEVE 4



Tél +41 22 379 94 95
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/c5a14b1d/attachment.html>


More information about the CWB mailing list