[CWB] parallel German and English corpus encoding

Anne Schumann ak47schumann at gmail.com
Fri Apr 8 09:18:52 CEST 2016


Dear CQP experts,

I would like to set up a parallel German and English corpus and I have two
related questions:

1. I understand that the main difficulty here is to align the corpus. Is it
possible to port existing alignments (e.g. a translation memory or outputs
of other tools) to CWB? So far, I have managed to align and encode a mere 7
sentences with cwb-align and related tools. Beyond that, the difficulty of
obtaining the exact same number of sentences on both sides from my sentence
splitter made it very hard for me to encode the corpus. Any hints or best
practices?

2. Maybe this is a naive question and not entirely related to CWB: Is there
a way to handle German characters (ä and the like) properly on the console,
that is, to ensure that they can be searched for and displayed properly?
Actually, my registry file tells me that "charset = 'utf8'", but searching
for Umlauts etc. triggers an error: "Query includes a character ... that is
invalid in the encoding specified for this corpus." At the moment, I work
on Windows.

Thanks in advance for your advice.

All the best,
Anne-Kathrin Schumann
-------------- n�chster Teil --------------
Ein Dateianhang mit HTML-Daten wurde abgetrennt...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160408/1cb06667/attachment.html>


More information about the CWB mailing list