[CWB] [CQPWeb] diacritics in CQPweb

Stefan Evert stefanML at collocations.de
Mon Mar 31 10:30:56 CEST 2014


On 31 Mar 2014, at 10:19, "Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:

> It isn't, because the MySQL data is always in UTF-8, even if the CWB index is in Latin-1...


And we definitely don't want to go back to the old ISO-8859 days ...

I've been wondering whether it would be possible to generate appropriate sort keys and store those in the MySQL database.  This would give us control over the collation used and should ensure both correct sort order and collapsing of "equivalent" forms in frequency counts.  It might even be possible to allow users to switch between different collations for their analyses on the fly.

The only complication that comes to mind is how to recreate the normal words from the sort keys so they can be displayed on screen.  Perhaps this would require an additional table of (surface form, sort key) combinations.


Am I overlooking something here?

Cheers,
Stefan






More information about the CWB mailing list