[CWB] [ cwb-Bugs-3046113 ] cqp sorting doesn't respect %c and %d in utf8 corpora

SourceForge.net noreply at sourceforge.net
Tue Aug 17 10:41:54 CEST 2010


Bugs item #3046113, was opened at 2010-08-16 11:48
Message generated for change (Comment added) made by schtepf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: None
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp sorting doesn't respect %c and %d in utf8 corpora

Initial Comment:
Almost certainly something to do with the interaction between %c and %d with the Glib function used for Unicode collation in special-chars.c.

* The Glib function does not seem to be respecting locales, contrary to my expectations from its documentation. Glib collation seems ot be always case sensitive, always accent-sensitive, and always binary (see bug report to list by Peter Ljunglöf, below). A locale might need to be declared in the registry file - and thus imported - for the right locale to be set on a per-corpus basis. Alternatively, other means of string collation?

----------Orig bug reprot:

2. %c doesn't make any difference for sorting:

MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;                                
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>
MINISUC> sort X by word%c;
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>

3. %d doesn't make any difference for sorting:

MINISUC> X = "h.r" "jag|och";  
MINISUC> sort X by word
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>
MINISUC> sort X by word%d
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>

Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).

4. Another example for %d, using é, which in Swedish should come between e and f:

MINISUC> X = "id.*";
MINISUC> sort X by word;  
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>
MINISUC> sort X by word%d;
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>


----------------------------------------------------------------------

>Comment By: Stefan Evert (schtepf)
Date: 2010-08-17 10:41

Message:
Not sure whether this is relevant for the problem at hand, but %c and %d in
sort/count work by normalising strings to case/diacritic-folded form and
then doing a binary comparison on the normalised strings (just the same as
in CQP queries).  So I'm not sure whether collation has anything to do with
this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809


More information about the CWB mailing list