[CWB] [ cwb-Bugs-3046113 ] cqp sorting doesn't respect %c and %d in
	utf8 corpora
    SourceForge.net 
    noreply at sourceforge.net
       
    Mon Aug 16 11:48:12 CEST 2010
    
    
  
Bugs item #3046113, was opened at 2010-08-16 09:48
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: None
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp sorting doesn't respect %c and %d in utf8 corpora
Initial Comment:
Almost certainly something to do with the interaction between %c and %d with the Glib function used for Unicode collation in special-chars.c.
* The Glib function does not seem to be respecting locales, contrary to my expectations from its documentation. Glib collation seems ot be always case sensitive, always accent-sensitive, and always binary (see bug report to list by Peter Ljunglöf, below). A locale might need to be declared in the registry file - and thus imported - for the right locale to be set on a per-corpus basis. Alternatively, other means of string collation?
----------Orig bug reprot:
2. %c doesn't make any difference for sorting:
MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;                                
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>
MINISUC> sort X by word%c;
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>
3. %d doesn't make any difference for sorting:
MINISUC> X = "h.r" "jag|och";  
MINISUC> sort X by word
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>
MINISUC> sort X by word%d
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>
Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).
4. Another example for %d, using é, which in Swedish should come between e and f:
MINISUC> X = "id.*";
MINISUC> sort X by word;  
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>
MINISUC> sort X by word%d;
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>
----------------------------------------------------------------------
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809
    
    
More information about the CWB
mailing list