[CWB] Does cwb-align-encode support utf8?

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Jul 13 16:49:42 CEST 2012


Ah, of course, less - I shoulda known.

That suggests the following procedure:

- if the "home" corpus is ASCII or UTF8
--- set LESSCHARSET as UTF8
--- if the "foreign" corpus is ISO_8859
----- recode to UTF8 before printing
--- else
----- print as-is ie ASCII or UTF8
- else
--- set LESSCHARSET as [whatever the "home" corpus is]
--- if the "foreign" corpus is [the same as the "home" corpus]
----- print as-is
--- else
----- recode to ASCII before printing with substitution ON, i.e. with everything >=0x80 replaced with ?
----- (which will produce a string compatible with anything)

This will work because we can safely map any ISO_8859 to UTF8, but not vice versa (and not from one ISO_8859 to another).

Sound about right?

>>> In the long run, wouldn't it make much more sense to switch all corpora to UTF-8?

You'll get no argument from me on that one! but we do have a somewhat established pattern of going to insane lengths in the name of backwards compatibility e.g. the day or two I spent last year coding up case- and accent-folding tables for the 8859 charsets...

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 12 July 2012 15:27
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Does cwb-align-encode support utf8?


> Hmmm. Ray's initial problem was solved, but there does seem to be an underlying problem - arising from the English charset defaulting to Latin1 when unspecified, and the aligned Chinese data thus being treated as Latin1 for output.  (Though I don't understand why the Chinese data, once output, wasn't treated as UTF8 by the terminal...)

I suspect that this happens when data are viewed interactively in CQP?  Then it's an issue with the "less" pager used as a backend.  CQP needs to set the environment variable LESSCHARSET in order to tell "less" whether the input is in LatinX or UTF-8 encoding.  This is done based on the currently active corpus and cannot be switched across different parts of the output.

> 
> There is some kind of issue here, however I'm not quite sure what the answer is.
>  
> *         Clearly, it would be advantageous to allow alignment to be declared between two corpora that are in different charsets.
> *         However, that creates problems for display, since...
> *         ...it's equally clearly undesirable for CQP to be outputting two charsets in the same chunk of output.

Exactly.  The alignments should already be encoded correctly, but they cannot be displayed sensibly.

The only reasonable solution would be to allow CQP to re-encode text (and queries), so all corpora can be queried in UTF-8, Latin1, ... regardless of which encoding they're actually stored in.  But this probably opens several cans of worms ...

In the long run, wouldn't it make much more sense to switch all corpora to UTF-8?

Cheers,
Stefan


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list