[CWB] Parallel Corpora
Philippe Baudrion
Philippe.Baudrion at unige.ch
Fri Jun 10 16:46:35 CEST 2016
Yes indeed, the files are now populated...
Thank you again Andrew.
Best, Philippe
On 06/10/2016 03:54 PM, Philippe Baudrion wrote:
> You are right, I have manually created the file so I have used
> indentation.
> I will try without them and tell you.
> Thank you.
>
> On 06/10/2016 03:43 PM, Hardie, Andrew wrote:
>>
>> It looks like your input file has whitespace at the start of the
>> lines containing the <seg> and <s> elements.
>>
>> Is this the case, or is it just an artefact of the email?
>>
>> IF there really is whitespace there, that is your problem. XML tags
>> should not be preceded on the line by whitespace.
>>
>> best
>>
>> Andrew.
>>
>> *From:*cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Philippe Baudrion
>> *Sent:* 10 June 2016 14:41
>> *To:* Open source development of the Corpus WorkBench
>> *Subject:* Re: [CWB] Parallel Corpora
>>
>> Thank you Susanne for your quick answer.
>> Until now I have only tried automatic indexing through CQPweb.
>> I guess I will need to dig a bit more CQP encoding options in order
>> to have it work.
>> Thank you for putting me on the right track, Philippe
>>
>> On 06/10/2016 02:54 PM, Susanne Flach wrote:
>>
>> Dear Philippe,
>>
>> Have you tried declaring nested XML elements with :0 as described
>> in Sec 4?
>>
>> http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html
>>
>> I’ve never had your problem, but I have always used the :0.
>>
>> Best,
>>
>> Susanne
>>
>>
>> --
>> Susanne Flach, M.A.
>> Arbeitsbereich Linguistik
>> Institut für Englische Philologie
>> Freie Universität Berlin
>> Habelschwerdter Allee 45
>> 14195 Berlin
>>
>> NEU! Korpustutorium mit CQP
>> <http://userpage.fu-berlin.de/%7Eflach/corpling/>
>>
>> http://userpage.fu-berlin.de/~flach/
>> <http://userpage.fu-berlin.de/%7Eflach/>
>>
>> Raum JK29/223
>> Telefon +49 30 838 72311
>>
>> On 10 Jun 2016, at 14:39, Philippe Baudrion
>> <Philippe.Baudrion at unige.ch
>> <mailto:Philippe.Baudrion at unige.ch>> wrote:
>>
>> Dear all,
>> I am trying to index the following corpus structure but it is
>> not working. Here is an extract of the corpus:
>>
>> <text id="FR_DI_2000_1" organisation="CERD" country="Francia"
>> type="Documento informativo" year="2000"
>> signature="CERD/C/SR.1373">
>> <s id="1">
>> <seg lang="fr">
>> La
>> séance
>> est
>> ouverte
>> à
>> 10h05
>> .
>> </seg>
>> <seg lang="es">
>> Se
>> declara
>> abierta
>> la
>> sesión
>> a
>> las
>> 10.05
>> horas
>> .
>> </seg>
>> </s>
>> ...
>> </text>
>>
>> The corresponding files on the disk drive remains empty:
>>
>> > ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
>>
>> total 120
>>
>> drwxr-xr-x 2 www-data www-data 4096 Jun 6 12:18 ./
>>
>> drwxrwxr-x 58 www-data letrint 4096 Jun 6 12:18 ../
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avs
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avx
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.rng
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg.rng
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avs
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avx
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.rng
>>
>> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s.rng
>>
>> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avs
>>
>> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avx
>>
>> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.rng
>>
>> -rw-r--r-- 1 www-data www-data 13 Jun 6 12:18 text_id.avs
>>
>> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.avx
>>
>> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.rng
>>
>> ...
>>
>>
>> The indexing command is as follow:
>>
>> > cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es" -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
>>
>> > cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1
>>
>> I guess due to the redundence of the <seg> element it is
>> impossible to correctely index that corpus, but I want to
>> have your opinion on that.
>>
>> In case it is possible, what would then be the correct
>> indexing command.
>>
>> Thank you for your help, greetings,
>>
>> --
>>
>> Baudrion Philippe
>>
>> Correspondant Informatique
>>
>> UNIVERSITE DE GENEVE
>>
>> Faculté de traduction et d'interprétation
>>
>> 40, bd. du Pont d'Arve
>>
>> 1211 GENEVE 4
>>
>> Tél +41 22 379 94 95
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> --
>> Baudrion Philippe
>> Correspondant Informatique
>> UNIVERSITE DE GENEVE
>> Faculté de traduction et d'interprétation
>> 40, bd. du Pont d'Arve
>> 1211 GENEVE 4
>> Tél +41 22 379 94 95
>
> --
> Baudrion Philippe
> Correspondant Informatique
>
> UNIVERSITE DE GENEVE
> Faculté de traduction et d'interprétation
> 40, bd. du Pont d'Arve
> 1211 GENEVE 4
>
> Tél +41 22 379 94 95
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
--
Baudrion Philippe
Correspondant Informatique
UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4
Tél +41 22 379 94 95
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/f88c8705/attachment.html>
More information about the CWB
mailing list