[CWB] Parallel Corpora
Philippe Baudrion
Philippe.Baudrion at unige.ch
Fri Jun 10 15:54:40 CEST 2016
You are right, I have manually created the file so I have used indentation.
I will try without them and tell you.
Thank you.
On 06/10/2016 03:43 PM, Hardie, Andrew wrote:
>
> It looks like your input file has whitespace at the start of the lines
> containing the <seg> and <s> elements.
>
> Is this the case, or is it just an artefact of the email?
>
> IF there really is whitespace there, that is your problem. XML tags
> should not be preceded on the line by whitespace.
>
> best
>
> Andrew.
>
> *From:*cwb-bounces at sslmit.unibo.it
> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Philippe Baudrion
> *Sent:* 10 June 2016 14:41
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Parallel Corpora
>
> Thank you Susanne for your quick answer.
> Until now I have only tried automatic indexing through CQPweb.
> I guess I will need to dig a bit more CQP encoding options in order to
> have it work.
> Thank you for putting me on the right track, Philippe
>
> On 06/10/2016 02:54 PM, Susanne Flach wrote:
>
> Dear Philippe,
>
> Have you tried declaring nested XML elements with :0 as described
> in Sec 4?
>
> http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node5.html
>
> I’ve never had your problem, but I have always used the :0.
>
> Best,
>
> Susanne
>
>
> --
> Susanne Flach, M.A.
> Arbeitsbereich Linguistik
> Institut für Englische Philologie
> Freie Universität Berlin
> Habelschwerdter Allee 45
> 14195 Berlin
>
> NEU! Korpustutorium mit CQP
> <http://userpage.fu-berlin.de/%7Eflach/corpling/>
>
> http://userpage.fu-berlin.de/~flach/
> <http://userpage.fu-berlin.de/%7Eflach/>
>
> Raum JK29/223
> Telefon +49 30 838 72311
>
> On 10 Jun 2016, at 14:39, Philippe Baudrion
> <Philippe.Baudrion at unige.ch
> <mailto:Philippe.Baudrion at unige.ch>> wrote:
>
> Dear all,
> I am trying to index the following corpus structure but it is
> not working. Here is an extract of the corpus:
>
> <text id="FR_DI_2000_1" organisation="CERD" country="Francia"
> type="Documento informativo" year="2000"
> signature="CERD/C/SR.1373">
> <s id="1">
> <seg lang="fr">
> La
> séance
> est
> ouverte
> à
> 10h05
> .
> </seg>
> <seg lang="es">
> Se
> declara
> abierta
> la
> sesión
> a
> las
> 10.05
> horas
> .
> </seg>
> </s>
> ...
> </text>
>
> The corresponding files on the disk drive remains empty:
>
> > ll /export/data/CQPweb_data/corpus/test_pb_fr_es/
>
> total 120
>
> drwxr-xr-x 2 www-data www-data 4096 Jun 6 12:18 ./
>
> drwxrwxr-x 58 www-data letrint 4096 Jun 6 12:18 ../
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avs
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.avx
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg_lang.rng
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 seg.rng
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avs
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.avx
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s_id.rng
>
> -rw-r--r-- 1 www-data www-data 0 Jun 6 12:18 s.rng
>
> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avs
>
> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.avx
>
> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_country.rng
>
> -rw-r--r-- 1 www-data www-data 13 Jun 6 12:18 text_id.avs
>
> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.avx
>
> -rw-r--r-- 1 www-data www-data 8 Jun 6 12:18 text_id.rng
>
> ...
>
>
> The indexing command is as follow:
>
> > cwb-encode -xsB -c utf8 -d /export/data/CQPweb_data/corpus/test_pb_fr_es -f /export/data/CQPweb_data/upload/Test-PB-FR_ES.vrt -R "/export/data/CQPweb_data/registry/test_pb_fr_es" -S text+id+organisation+country+type+year+signature -S s+id -S seg+lang 2>&1
>
> > cwb-makeall -r "/export/data/CQPweb_data/registry" -V TEST_PB_FR_ES 2>&1
>
> I guess due to the redundence of the <seg> element it is
> impossible to correctely index that corpus, but I want to have
> your opinion on that.
>
> In case it is possible, what would then be the correct
> indexing command.
>
> Thank you for your help, greetings,
>
> --
>
> Baudrion Philippe
>
> Correspondant Informatique
>
> UNIVERSITE DE GENEVE
>
> Faculté de traduction et d'interprétation
>
> 40, bd. du Pont d'Arve
>
> 1211 GENEVE 4
>
> Tél +41 22 379 94 95
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> --
> Baudrion Philippe
> Correspondant Informatique
> UNIVERSITE DE GENEVE
> Faculté de traduction et d'interprétation
> 40, bd. du Pont d'Arve
> 1211 GENEVE 4
> Tél +41 22 379 94 95
--
Baudrion Philippe
Correspondant Informatique
UNIVERSITE DE GENEVE
Faculté de traduction et d'interprétation
40, bd. du Pont d'Arve
1211 GENEVE 4
Tél +41 22 379 94 95
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160610/39e19b7a/attachment-0001.html>
More information about the CWB
mailing list