[CWB] [ cwb-Bugs-3046113 ] cqp sorting doesn't respect %c and %d in utf8 corpora

Mon Aug 1 00:56:55 CEST 2011

Bugs item #3046113, was opened at 2010-08-16 09:48
Message generated for change (Settings changed) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
>Group: TODO-3.5
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp sorting doesn't respect %c and %d in utf8 corpora

Initial Comment:
Almost certainly something to do with the interaction between %c and %d with the Glib function used for Unicode collation in special-chars.c.

* The Glib function does not seem to be respecting locales, contrary to my expectations from its documentation. Glib collation seems ot be always case sensitive, always accent-sensitive, and always binary (see bug report to list by Peter Ljunglöf, below). A locale might need to be declared in the registry file - and thus imported - for the right locale to be set on a per-corpus basis. Alternatively, other means of string collation?

----------Orig bug reprot:

2. %c doesn't make any difference for sorting:

MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;                                
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>
MINISUC> sort X by word%c;
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>

3. %d doesn't make any difference for sorting:

MINISUC> X = "h.r" "jag|och";  
MINISUC> sort X by word
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>
MINISUC> sort X by word%d
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>

Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).

4. Another example for %d, using é, which in Swedish should come between e and f:

MINISUC> X = "id.*";
MINISUC> sort X by word;  
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>
MINISUC> sort X by word%d;
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>

----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2010-08-17 08:57

Message:
Not in utf8 mode they don't!  (see cl_string_qsort_compare() in
special-chars.c) That was the big change in my most recent commit.

int result = (int)g_utf8_collate((gchar *)s1, (gchar *)s2);

(which - note to self - should almost certainly be  
int result = (int)g_utf8_collate((gchar *)comp1, (gchar *)comp2); 
in any case...)

We can guarantee case/diac INsensitivity by means of normalisation prior
to this call. But I don't know if we can guarantee sensitivity, if Glib
decides to be insensitive of its own accord. That's the issue I was
concerned about.

In retrospect, this is probably not the same issue as the problem Peter is
having - looks as if he has a non-most-recent version, where UTF8 collation
was done on binary, just as in Latin1, and %c %d just wouldn't work at all
for UTf8 (ie enforced insensitivity).

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2010-08-17 08:41

Message:
Not sure whether this is relevant for the problem at hand, but %c and %d in
sort/count work by normalising strings to case/diacritic-folded form and
then doing a binary comparison on the normalised strings (just the same as
in CQP queries).  So I'm not sure whether collation has anything to do with
this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809