[CWB] Parallel Corpora

Philippe Baudrion Philippe.Baudrion at unige.ch
Fri Jun 10 14:39:41 CEST 2016


Dear all,
I am trying to index the following corpus structure but it is not 
working. Here is an extract of the corpus:

<text id="FR_DI_2000_1" organisation="CERD" country="Francia" 
type="Documento informativo" year="2000" signature="CERD/C/SR.1373">
     <s id="1">
         <seg lang="fr">
La
séance
est
ouverte
à
10h05
.
</seg>
         <seg lang="es">
Se
declara
abierta
la
sesión
a
las
10.05
horas
.
         </seg>
     </s>
...
</text>

The corresponding files on the disk drive remains empty:

> ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
           total 120
           drwxr-xr-x  2 www-data www-data 4096 Jun  6 12:18 ./
           drwxrwxr-x 58 www-data letrint  4096 Jun  6 12:18 ../
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avs
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avx
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.rng
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg.rng
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avs
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avx
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.rng
           -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s.rng
           -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avs
           -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avx
           -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.rng
           -rw-r--r--  1 www-data www-data   13 Jun  6 12:18 text_id.avs
           -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.avx
           -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.rng
           ...


The indexing command is as follow:

> cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es"  -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
> cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1

I guess due to the redundence of the <seg> element it is impossible to 
correctely index that corpus, but I want to have your opinion on that. 
In case it is possible, what would then be the correct indexing command. 
Thank you for your help, greetings,

-- 
Baudrion Philippe
Correspondant Informatique

UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4

Tél +41 22 379 94 95

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/f5fc09ce/attachment.html>


More information about the CWB mailing list