[CWB] Parallel Corpora

Philippe Baudrion Philippe.Baudrion at unige.ch
Fri Jun 10 16:46:35 CEST 2016


Yes indeed, the files are now populated...
Thank you again Andrew.
Best, Philippe


On 06/10/2016 03:54 PM, Philippe Baudrion wrote:
> You are right, I have manually created the file so I have used 
> indentation.
> I will try without them and tell you.
> Thank you.
>
> On 06/10/2016 03:43 PM, Hardie, Andrew wrote:
>>
>> It looks like your input file has whitespace at the start of the 
>> lines containing the <seg> and <s> elements.
>>
>> Is this the case, or is it just an artefact of the email?
>>
>> IF there really is whitespace there, that is your problem. XML tags 
>> should not be preceded on the line by whitespace.
>>
>> best
>>
>> Andrew.
>>
>> *From:*cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Philippe Baudrion
>> *Sent:* 10 June 2016 14:41
>> *To:* Open source development of the Corpus WorkBench
>> *Subject:* Re: [CWB] Parallel Corpora
>>
>> Thank you Susanne for your quick answer.
>> Until now I have only tried automatic indexing through CQPweb.
>> I guess I will need to dig a bit more CQP encoding options in order 
>> to have it work.
>> Thank you for putting me on the right track, Philippe
>>
>> On 06/10/2016 02:54 PM, Susanne Flach wrote:
>>
>>     Dear Philippe,
>>
>>     Have you tried declaring nested XML elements with :0 as described
>>     in Sec 4?
>>
>>     http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html
>>
>>     I’ve never had your problem, but I have always used the :0.
>>
>>     Best,
>>
>>     Susanne
>>
>>
>>     --
>>     Susanne Flach, M.A.
>>     Arbeitsbereich Linguistik
>>     Institut für Englische Philologie
>>     Freie Universität Berlin
>>     Habelschwerdter Allee 45
>>     14195 Berlin
>>
>>     NEU! Korpustutorium mit CQP
>>     <http://userpage.fu-berlin.de/%7Eflach/corpling/>
>>
>>     http://userpage.fu-berlin.de/~flach/
>>     <http://userpage.fu-berlin.de/%7Eflach/>
>>
>>     Raum JK29/223
>>     Telefon +49 30 838 72311
>>
>>         On 10 Jun 2016, at 14:39, Philippe Baudrion
>>         <Philippe.Baudrion at unige.ch
>>         <mailto:Philippe.Baudrion at unige.ch>> wrote:
>>
>>         Dear all,
>>         I am trying to index the following corpus structure but it is
>>         not working. Here is an extract of the corpus:
>>
>>         <text id="FR_DI_2000_1" organisation="CERD" country="Francia"
>>         type="Documento informativo" year="2000"
>>         signature="CERD/C/SR.1373">
>>             <s id="1">
>>                 <seg lang="fr">
>>         La
>>         séance
>>         est
>>         ouverte
>>         à
>>         10h05
>>         .
>>         </seg>
>>                 <seg lang="es">
>>         Se
>>         declara
>>         abierta
>>         la
>>         sesión
>>         a
>>         las
>>         10.05
>>         horas
>>         .
>>                 </seg>
>>             </s>
>>         ...
>>         </text>
>>
>>         The corresponding files on the disk drive remains empty:
>>
>>         > ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
>>
>>                    total 120
>>
>>                    drwxr-xr-x  2 www-data www-data 4096 Jun  6 12:18 ./
>>
>>                    drwxrwxr-x 58 www-data letrint  4096 Jun  6 12:18 ../
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avs
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.avx
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg_lang.rng
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 seg.rng
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avs
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.avx
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s_id.rng
>>
>>                    -rw-r--r--  1 www-data www-data    0 Jun  6 12:18 s.rng
>>
>>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avs
>>
>>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.avx
>>
>>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_country.rng
>>
>>                    -rw-r--r--  1 www-data www-data   13 Jun  6 12:18 text_id.avs
>>
>>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.avx
>>
>>                    -rw-r--r--  1 www-data www-data    8 Jun  6 12:18 text_id.rng
>>
>>                    ...
>>
>>
>>         The indexing command is as follow:
>>
>>         > cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es"  -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
>>
>>         > cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1
>>
>>         I guess due to the redundence of the <seg> element it is
>>         impossible to correctely index that corpus, but I want to
>>         have your opinion on that.
>>
>>         In case it is possible, what would then be the correct
>>         indexing command.
>>
>>         Thank you for your help, greetings,
>>
>>         -- 
>>
>>         Baudrion Philippe
>>
>>         Correspondant Informatique
>>
>>         UNIVERSITE DE GENEVE
>>
>>         Faculté de traduction et d'interprétation
>>
>>         40, bd. du Pont d'Arve
>>
>>         1211 GENEVE 4
>>
>>         Tél +41 22 379 94 95
>>
>>         _______________________________________________
>>         CWB mailing list
>>         CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>         http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> -- 
>> Baudrion Philippe
>> Correspondant Informatique
>> UNIVERSITE DE GENEVE
>> Faculté de traduction et d'interprétation
>> 40, bd. du Pont d'Arve
>> 1211 GENEVE 4
>> Tél +41 22 379 94 95
>
> -- 
> Baudrion Philippe
> Correspondant Informatique
>
> UNIVERSITE DE GENEVE
> Faculté de traduction et d'interprétation
> 40, bd. du Pont d'Arve
> 1211 GENEVE 4
>
> Tél +41 22 379 94 95
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Baudrion Philippe
Correspondant Informatique

UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4

Tél +41 22 379 94 95

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/f88c8705/attachment.html>


More information about the CWB mailing list