[CWB] [ cwb-Bugs-2917570 ] CQPweb: inappropriate collation for German & other languages

Wed Dec 23 17:00:28 CET 2009

Bugs item #2917570, was opened at 2009-12-19 10:43
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2917570&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQPweb
Group: None
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: Stefan Evert (schtepf)
Assigned to: Andrew Hardie (andrewhardie)
Summary: CQPweb: inappropriate collation for German & other languages

Initial Comment:
CQPweb stores all MySQL tables with collation utf8_general_ci, which is inappropriate for German and presumably many other languages because it ignores accents and other diacritics in addition to case-folding strings.  This leads to bogus entries in collocation tables, frequency lists and frequency breakdown.  

Examples: German verbs "fallen" and "fällen" are collapsed into a single entry, which is labelled randomly and may mislead German users to think that the frequent verb "fallen" doesn't occur in a corpus. Also, verbs and nouns are often collapsed, e.g. "treffen" and "Treffen".

For such languages, there should be an option to use binary collation (utf8_bin) in all MySQL tables (specified on a per-corpus basis) in order to get sensible collocations and frequency lists.  In the current state, these options are unusable for German corpora.

(Note: the most appropriate collation for German seems to be latin1_german2_ci, but this would require multi-charset support throughout CQPweb.)

----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2009-12-23 16:00

Message:
This is actually a whole lot worse than it seems at first glance once you
do a bit of digging.

Allowing customisation of collations on a per-corpus basis would be
relatively easy to do, thus: the default collation for the database should
remain utf8_general_ci (for tables like corpus_metadata_fixed etc., where
collation isn't really an issue). Some fields have an override to utf8_bin
already (most notably, simple_query and cqp_query in saved_queries, where
matching in a case-insensitive way is NOT what we want. HOWEVER, tables
that belong to the corpus (text_metadata_for_*, freq_*, freq_sc_*, db_*,
etc etc etc) should pull their default collation from a per-corpus variable
that is chosen at setup time. [Fields that need utf8bin can have a
single-field override.] Call it $corpus_sql_collation, and default it to
utf8_general_ci. 

However, there are still two issues which, taken together, make me doubt
whether this is the right way to go. 

First -- German remains problematic, because there is no German-specific
collation for utf8. (The recommended collation is utf8_general_ci or
utf8_unicode_ci, BOTH of which would merge "fallen" and "fällen" see here:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html 
; I imagine this is because the collations are primarily intended for
sorting, not merging.) utf8_bin is also inappropriate because it doesn't
fold case. (BTW I don't see that treffen and Treffen being merged is a bad
thing; the former can occur in sentence-initial position and either could
occur in BLOCKCAPS after all... and distinguishing homonymous nouns and
verbs is what the combo-annotation is for).
But there are two possibilities, one easy and unsatisfactory and one
harder and better. EASY -- use one of the other language collations as a
stopgap. HARD -- create a custom german collation e.g. implementing
latin1_german2_ci for utf8: see
http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html
you write an XML file specifying how the collation you want differs from
utf8_unicode_ci and then load it.

That said, latin1_german2 is not ideal either, as it folds some things
that CQP would not fold - e.g. 
Ä = AE
Ö = OE
Ü = UE
ß = ss
... causing the same difficulties as fallen~fällen only with
faellen~fällen instead (the one that occurs in the freq-list will happen
to be the one that comes first in the corpus; search links generated from
the freq list will find fewer results than the frequency of that item;
etc.) Granted this isn't as bad as fallen~fällen because faellen doesn't
coincide with a different wordform. But it's still not what we want. 

This is the second issue - ALL the language-specific collations seem to
have at least some accent folding just like utf8_genreal and utf8_unicode
-- BUT (a) CQP's accent folding is turned off in CQPweb by default (and you
can only turn it on by using CQP syntax...) (b) even if you do turn it on,
you will get different accent folding; CQP (according to current behaviour)
will fold Ä to A making it incompatible with the german2 collation which
folds Ä to AE.

In short we are going to get mysql~cqp mismatches for any collation except
utf8_bin -- but that doesn't allow case folding, which we DO want.

Is there a unicode collation that only fold case and NOTHING else? if so
that's what we want, for all languages. I haven't been able to find one but
perhaps we could autogenerate one from here:
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
But even that includes SOME "accent" folding that would screw up CQP
searches, by converting lowercase letters with no corresponding uppercase
to equivalent multicharacter uppercase. Examples:
0149; F; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
(latter obviously an issue for German).
These are all marked as "F" however, and there are alternatives marked as
S:
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
So this might be a solution.

It will also be necessary to check how case insensitivity works in
whatever utf8-regex-library we use in CQP. Here's what is said about
caseless mathcing in the PCRE man file (and thus, presumably, Glib, though
I've not checked):
<blockquote>
       Case-insensitive matching applies only to  characters  whose 
values
       are  less than 128, unless PCRE is built with Unicode property
support.
       Even when Unicode property support is available, PCRE  still  uses 
its
       own  character  tables when checking the case of low-valued
characters,
       so as not to degrade performance.  The Unicode property information
 is
       used only for characters with higher values. Even when Unicode
property
       support is available, PCRE supports case-insensitive matching only
when
       there  is  a  one-to-one  mapping between a letter's cases. There
are a
       small number of many-to-one mappings in Unicode;  these  are  not 
sup-
       ported by PCRE.
</blockquote>

... which sounds as if PCRE does with CaseFolding.txt the same thing that
I proposed above. In which case all would be excellent.

</stream-of-consciousness>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2917570&group_id=131809