[CWB] Parallel Corpora
Hardie, Andrew
a.hardie at lancaster.ac.uk
Fri Jun 10 15:43:39 CEST 2016
It looks like your input file has whitespace at the start of the lines containing the <seg> and <s> elements.
Is this the case, or is it just an artefact of the email?
IF there really is whitespace there, that is your problem. XML tags should not be preceded on the line by whitespace.
best
Andrew.
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Philippe Baudrion
Sent: 10 June 2016 14:41
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Parallel Corpora
Thank you Susanne for your quick answer.
Until now I have only tried automatic indexing through CQPweb.
I guess I will need to dig a bit more CQP encoding options in order to have it work.
Thank you for putting me on the right track, Philippe
On 06/10/2016 02:54 PM, Susanne Flach wrote:
Dear Philippe,
Have you tried declaring nested XML elements with :0 as described in Sec 4?
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html
I’ve never had your problem, but I have always used the :0.
Best,
Susanne
--
Susanne Flach, M.A.
Arbeitsbereich Linguistik
Institut für Englische Philologie
Freie Universität Berlin
Habelschwerdter Allee 45
14195 Berlin
NEU! Korpustutorium mit CQP<http://userpage.fu-berlin.de/%7Eflach/corpling/>
http://userpage.fu-berlin.de/~flach/<http://userpage.fu-berlin.de/%7Eflach/>
Raum JK29/223
Telefon +49 30 838 72311
On 10 Jun 2016, at 14:39, Philippe Baudrion <Philippe.Baudrion at unige.ch<mailto:Philippe.Baudrion at unige.ch>> wrote:
Dear all,
I am trying to index the following corpus structure but it is not working. Here is an extract of the corpus:
<text id="FR_DI_2000_1" organisation="CERD" country="Francia" type="Documento informativo" year="2000" signature="CERD/C/SR.1373">
<s id="1">
<seg lang="fr">
La
séance
est
ouverte
à
10h05
.
</seg>
<seg lang="es">
Se
declara
abierta
la
sesión
a
las
10.05
horas
.
</seg>
</s>
...
</text>
The corresponding files on the disk drive remains empty:
> ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
total 120
drwxr-xr-x 2 www-data www-data 4096 Jun 6 12:18 ./
drwxrwxr-x 58 www-data letrint 4096 Jun 6 12:18 ../
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avs
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avx
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.rng
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg.rng
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avs
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avx
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.rng
-rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s.rng
-rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avs
-rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avx
-rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.rng
-rw-r--r-- 1 www-data www-data 13 Jun 6 12:18 text_id.avs
-rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.avx
-rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.rng
...
The indexing command is as follow:
> cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es" -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
> cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1
I guess due to the redundence of the <seg> element it is impossible to correctely index that corpus, but I want to have your opinion on that.
In case it is possible, what would then be the correct indexing command.
Thank you for your help, greetings,
--
Baudrion Philippe
Correspondant Informatique
UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4
Tél +41 22 379 94 95
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
--
Baudrion Philippe
Correspondant Informatique
UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4
Tél +41 22 379 94 95
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/c5a14b1d/attachment.html>
More information about the CWB
mailing list