[CWB] Parallel Corpora

Philippe Baudrion Philippe.Baudrion at unige.ch
Fri Jun 10 15:54:40 CEST 2016


You are right, I have manually created the file so I have used indentation.
I will try without them and tell you.
Thank you.

On 06/10/2016 03:43 PM, Hardie, Andrew wrote:
>
> It looks like your input file has whitespace at the start of the lines 
> containing the <seg> and <s> elements.
>
> Is this the case, or is it just an artefact of the email?
>
> IF there really is whitespace there, that is your problem. XML tags 
> should not be preceded on the line by whitespace.
>
> best
>
> Andrew.
>
> *From:*cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Philippe Baudrion
> *Sent:* 10 June 2016 14:41
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Parallel Corpora
>
> Thank you Susanne for your quick answer.
> Until now I have only tried automatic indexing through CQPweb.
> I guess I will need to dig a bit more CQP encoding options in order to 
> have it work.
> Thank you for putting me on the right track, Philippe
>
> On 06/10/2016 02:54 PM, Susanne Flach wrote:
>
>     Dear Philippe,
>
>     Have you tried declaring nested XML elements with :0 as described
>     in Sec 4?
>
>     http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html
>
>     I’ve never had your problem, but I have always used the :0.
>
>     Best,
>
>     Susanne
>
>
>     --
>     Susanne Flach, M.A.
>     Arbeitsbereich Linguistik
>     Institut für Englische Philologie
>     Freie Universität Berlin
>     Habelschwerdter Allee 45
>     14195 Berlin
>
>     NEU! Korpustutorium mit CQP
>     <http://userpage.fu-berlin.de/%7Eflach/corpling/>
>
>     http://userpage.fu-berlin.de/~flach/
>     <http://userpage.fu-berlin.de/%7Eflach/>
>
>     Raum JK29/223
>     Telefon +49 30 838 72311
>
>         On 10 Jun 2016, at 14:39, Philippe Baudrion
>         <Philippe.Baudrion at unige.ch
>         <mailto:Philippe.Baudrion at unige.ch>> wrote:
>
>         Dear all,
>         I am trying to index the following corpus structure but it is
>         not working. Here is an extract of the corpus:
>
>         <text id="FR_DI_2000_1" organisation="CERD" country="Francia"
>         type="Documento informativo" year="2000"
>         signature="CERD/C/SR.1373">
>             <s id="1">
>                 <seg lang="fr">
>         La
>         séance
>         est
>         ouverte
>         à
>         10h05
>         .
>         </seg>
>                 <seg lang="es">
>         Se
>         declara
>         abierta
>         la
>         sesión
>         a
>         las
>         10.05
>         horas
>         .
>                 </seg>
>             </s>
>         ...
>         </text>
>
>         The corresponding files on the disk drive remains empty:
>
>         > ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
>
>                    total 120
>
>                    drwxr-xr-x  2 www-data www-data 4096 Jun  6 12:18 ./
>
>                    drwxrwxr-x 58 www-data letrint  4096 Jun  6 12:18 ../
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avs
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avx
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.rng
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg.rng
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avs
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avx
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.rng
>
>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s.rng
>
>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avs
>
>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avx
>
>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.rng
>
>                    -rw-r--r--  1 www-data www-data   13 Jun  6 12:18 text_id.avs
>
>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.avx
>
>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.rng
>
>                    ...
>
>
>         The indexing command is as follow:
>
>         > cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es"  -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
>
>         > cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1
>
>         I guess due to the redundence of the <seg> element it is
>         impossible to correctely index that corpus, but I want to have
>         your opinion on that.
>
>         In case it is possible, what would then be the correct
>         indexing command.
>
>         Thank you for your help, greetings,
>
>         -- 
>
>         Baudrion Philippe
>
>         Correspondant Informatique
>
>         UNIVERSITE DE GENEVE
>
>         Faculté de traduction et d'interprétation
>
>         40, bd. du Pont d'Arve
>
>         1211 GENEVE 4
>
>         Tél +41 22 379 94 95
>
>         _______________________________________________
>         CWB mailing list
>         CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>         http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> -- 
> Baudrion Philippe
> Correspondant Informatique
> UNIVERSITE DE GENEVE
> Faculté de traduction et d'interprétation
> 40, bd. du Pont d'Arve
> 1211 GENEVE 4
> Tél +41 22 379 94 95

-- 
Baudrion Philippe
Correspondant Informatique

UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4

Tél +41 22 379 94 95

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/39e19b7a/attachment-0001.html>


More information about the CWB mailing list