[CWB] Aligning parallel corpora
Hardie, Andrew
a.hardie at lancaster.ac.uk
Tue May 7 16:53:34 CEST 2019
Hi Graham,
>> I have not been able to find the English and German Holmes files used in Stefan Evert's tutorial
They aren’t currently in the release package (which I believe is probably in need of an update!) but they are in the SVN tree: https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/encoding_tutorial_data/
>> what exactly is the required input format for the cwb-align command?
cwb-align is an aligner, you don’t need to use it if you are dealing with pre-aligned data. Its output format is what you need to generate (to then input into cwb-align-encode, the program that actually creates the a-attribute). This means in the tutorial, you can skip to sec 8.4.
The format necessary (called “.align file”) is described in the section “OUTPUT FORMAT” of man cwb-align. Pasted below.
You would need to output the begin/end points of the s attributes in the first corpus (using cwb-s-decode), and then combine that with the begin/end points of the s attribute in the second corpus, to get the 4-tuples needed by cwb-align-encode. And then you’d need to add 1:1 at the end of each line. That 5 column file defines the alignment.
>> If I have .vrt files created in two languages with treetagger, and if I have prealigned these, in such a way that the first sentence of one file corresponds to the first sentence of the other, the second sentence to the second, etc. then is that enough?
Yes, that is enough if you use the method above to create the input file for cwb-align-encode.
>> Or should my files also including numerical information with all sentences numbered?
If your ranges do actually have unique ID attributes <s id=”..”> or maybe <s n=”..”>, you can optionally use cwb-align-import instead of cwb-align-encode: see section 8.5 . This has a different input format than cwb-align-encode: it identifies matching segments by some ID attribute, rather than by spelling out the corpus token position ranges, as in cwb-align-encode. This is called a bead file whereas the other is called an .align file (not the clearest terminology I know).
Your sentences would have to be numbered through your whole corpus. In which case, your bead file could look like this:
CORP1 CORP2 s {n}
1 1
2 2
3 3
[…]
which might be easier to auto-generate than the cwb-align-encode format based on raw corpus position numbers.
Finally, whatever method you use, don’t forget the need to update the registry file with the new a-attribute.
best
Andrew.
=====================
From man cwb-align:
OUTPUT FORMAT
cwb-align's output file uses CWB's ".align" file format. ".align" files are ASCII text files
(although they may contain characters from another encoding if the corpus IDs include non-ASCII
characters), formatted as follows.
The first line is a header line, which contains the following four elements, separated by tabs:
· The ID of the source corpus
· The ID of the aligned s-attribute (the grid attribute - see above)
· The ID of the target corpus
· The ID of the aligned s-attribute (repeated)
Following the header, each individual line represents a single pair of aligned regions in the
corpus. This is specified by six fields of information, separated by tabs. The six fields are
as follows:
· The beginning of the range in the source corpus (expressed as a cpos, i.e. a token number)
· The end of the range in the source corpus (expressed as a cpos)
· The beginning of the range in the target corpus (expressed as a cpos)
· The end of the range in the target corpus (expressed as a cpos)
· The type of alignment: 1:1, 2:1, 1:2 or 2:2
· The quality of the alignment: a score calculated by the alignment engine
For example,
140 169 137 180 1:2
means that corpus position ranges [140,169] and [137,180] form a 1:2 alignment pair.
(The final field, the quality, is optional in this file format, and is absent in the example
above; however, cwb-align will always provide it.)
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Graham Ranger -- UAPV
Sent: 07 May 2019 13:57
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: [CWB] Aligning parallel corpora
Hello to all,
I have set up a parallel corpus on cqpweb using s-attributes for the visualisation of translations but I would like to be able to do the same thing more cleanly, using alignment attributes. However, try as I might, I cannot seem to follow the instructions in the encoding tutorial. I have not been able to find the English and German Holmes files used in Stefan Evert's tutorial for illustration. Now, what I would like to know is: what exactly is the required input format for the cwb-align command? If I have .vrt files created in two languages with treetagger, and if I have prealigned these, in such a way that the first sentence of one file corresponds to the first sentence of the other, the second sentence to the second, etc. then is that enough? Or should my files also including numerical information with all sentences numbered? I suspect this is a very naive question, but it's one that I do not seem to be able to find my way around without help!
Best,
Graham.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190507/37b7c492/attachment-0001.html>
More information about the CWB
mailing list