[CWB] parallel German and English corpus encoding

Fri Apr 8 11:08:47 CEST 2016

Hi Anne-Kathrin.

Re 1. To port existing alignments, you have two options:

First, via “beads”. A. make sure that when you create your corpus, you have appropriate identifying s-attributes on your sentences. B. Second, create a “bead file” which contains the alignment data. The format of this file is described in man cwb-align-import . Depending on what format your alignment data is in you might need to do some scripting to massage the data into the right format. Once you have this file, use cwb-align-import to create the a-attribute.

Second, via a file containing the alignment, a “.align” file as created by cwb-align, and whose format is described in man cwb-align. The difference between this and the previous is that in a beadfile aligned regions are indicated by identifying values from an s-attribute, whereas in this option, actual corpus position numbers must be used. Depending on the form your alignment data is in, this may be easier. A “.align” is inserted into the index using cwb-align-encode.

(note: since you’re on windows, you will find the man files as PDFs within a folder in the CWB installation location)

The CWB tutorial chapter on alignment is currently half written and thus absent from the version on the website. I attach a build of the PDF I’ve just made, which contains further info on all the above.

Re 2. See  http://cwb.sourceforge.net/faq.php?hoist=windows_terminal#windows_terminal

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Anne Schumann
Sent: 08 April 2016 08:19
To: cwb at sslmit.unibo.it
Subject: [CWB] parallel German and English corpus encoding

Dear CQP experts,
I would like to set up a parallel German and English corpus and I have two related questions:
1. I understand that the main difficulty here is to align the corpus. Is it possible to port existing alignments (e.g. a translation memory or outputs of other tools) to CWB? So far, I have managed to align and encode a mere 7 sentences with cwb-align and related tools. Beyond that, the difficulty of obtaining the exact same number of sentences on both sides from my sentence splitter made it very hard for me to encode the corpus. Any hints or best practices?
2. Maybe this is a naive question and not entirely related to CWB: Is there a way to handle German characters (ä and the like) properly on the console, that is, to ensure that they can be searched for and displayed properly? Actually, my registry file tells me that "charset = 'utf8'", but searching for Umlauts etc. triggers an error: "Query includes a character ... that is invalid in the encoding specified for this corpus." At the moment, I work on Windows.
Thanks in advance for your advice.
All the best,
Anne-Kathrin Schumann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160408/97b4f36f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CWB_Encoding_Tutorial.pdf
Type: application/pdf
Size: 326842 bytes
Desc: CWB_Encoding_Tutorial.pdf
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160408/97b4f36f/attachment-0001.pdf>