[CWB] Re: CWB in Tanzania

Wed Sep 8 10:15:06 CEST 2010

Dear Gabriele,

TMX is relatively easy to convert to a format accepted by CWB, but you do need a bit of programming expertise.  I've created a simple perl script, but it might be too simple for your data:
http://corpus.leeds.ac.uk/tools/tmx2csar.pl
it creates files according to the languages in your TMX file, which can be processed by Treetagger and later encoded into CWB with alignment preserved.
Good luck,
Serge
________________________________________
From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert [stefan.evert at uos.de]
Sent: 07 September 2010 23:08
To: Gabriele Brandolini
Cc: CWBdev Mailing List
Subject: [CWB] Re: CWB in Tanzania

Dear Gabriele,

thanks for your e-mail.  It's fascinating to see in which parts of the world and for what languages the CWB is now being used!

> I'm an Italian missione working in Tanzania since 1983.
> I'm doing translation work, using mostly these couple of languages: Latin-Swahili; Italian-Swahili, and English-Swahili.
> I've prepared a good TMX for Latin-Swahili, in part from my own translation, and in part automatically created by using parallel texts.
>
> I've also collected a medium corpus of Swahili, a big deal by using BootCat.
> Also I was in contact with Helmut Schmid, in order to prepare the TreeTagger parameters for Swahili.
>
> Now, I'm just trying to see how CWB 3.0 works. Till now, I can make simple queries on my swahili tagged (with TreeTagger) corpus.

Good.

> I would like to ask you if there is a tutorial explaining how to work with BILINGAL bilingual, possibly starting from my TMX's. Otherwise, could you tell me briefly how to do?
>
> I also tried to align files by using cwb-align command, but I didn't menaged it. To me, not so expert in programming, and in perl code, the documentations given as "help" to the cwb-align command is too concise (and somehow, a bit criptic).

I'm afraid that part of the corpus encoding tutorial is still missing -- I should have written it several months ago, but more urgent things keep popping up so that I have to postpone the documentation work.

I hope to find the time to update the tutorial within the next two weeks.  However, some programming skills and experience will probably be required, especially if you want to import an existing alignment rather than run the CWB's own sentence aligner (which isn't extremely good).

I'm not familiar with the acronym TMX, but a Google search indicates that it is an XML-based alignment format.  I'm not aware of an existing utility that would help you to import files in this format into the CWB, but perhaps some other CWB users have already dealt with such files.  It's a good idea to ask such questions on the CWB mailing list (sign up here: http://devel.sslmit.unibo.it/mailman/listinfo/cwb) -- they might also be able to give you some tips on how to work with alignment in CQP and otherwise.  I'm CC:ing this message to the list so they know what we've been talking about.

Best wishes,
Stefan

PS: Here's a mini-tutorial for first steps with basic sentence alignment in the CWB.

Just so you can give it a try, here's a simple example run of the aligner: Let us assume that I have a parallel German-English corpus whose parts are encoded as VMGERMAN and VMENGLISH in the CWB, with sentences marked by <s>-Tags, i.e. a structural attribute "s" in the CWB.

Then I would start by running the sentence aligner:

        cwb-align -v -o vm_ger_eng.align VMGERMAN VMENGLISH s

This will run for a while (about a minute for a 300,000-word corpus) and then produce a text file vm_ger_eng.align with the sentence alignment information.  To check whether the aligner worked correctly, you can view this file with

        cwb-align-show vm_ger_eng.align

(you can use the "-w" option for a wider display, depending on the size of your terminal window).  Press Return to display the next alignment pair, "h" for other key commands, and "q" to exit the viewer.

If you're satisfied with the alignment quality, the next step is to encode the alignment information in the CWB so that you can use it from CQP (I'm afraid that section of the CQP query tutorial is also still missing, though).  The basic tool you need is "cwb-align-encode", which transforms the alignment text file into the binary format used by the CWB. First, though, you have to declare the new alignment attribute in the registry file.  For the German-English alignment, edit the registry file of VMGERMAN, adding the line

        ALIGNED vmenglish

(note the lowercase spelling of the attribute name!). If you've got the CWB/Perl tools installed, you can easily do this from the command line with

        cwb-regedit VMGERMAN :add :a vmenglish

Finally, encode the alignment attribute:

        cwb-align-encode -D vm_ger_eng.align

Don't be surprised if the command terminates immediately and there is no output -- encoding alignment is a very fast process.

Now you can delete the text file "vm_ger_eng.align".  Note that in this way, you've only generated a German-English alignment that's accessible from the VMGERMAN corpus -- if you also need English-German alignment, repeate all steps above swapping the two corpora.

A brief note on using alignment information in CQP, for the VMGERMAN-VMENGLISH alignment.  There are basically two reasonable uses of sentence alignment (many other things would be possible, but haven't been implemented in CQP).  The following commands are typed in a CQP session (everything after a "#" character is a comment you don't have to type in):

        VMGERMAN;
        set Context 1 s;  # sentence alignment makes most sense if you're also viewing sentence context
        "Bahn.+";    # some CQP query, here German words starting with "Bahn-"
        show +vmenglish; # activate display of sentence alignment
        cat; # redisplays query result, now giving aligned sentence for every query match
        "Bahn.+" :VMENGLISH "rail.*"; # only those matches where aligned sentence contains "rail" or a similar word
        "Bahn.+" :VMENGLISH ! "rail.*"; # only those matches where aligned sentence does NOT contain "rail"

Hope this helps to give you a first impression.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb