[CWB] [ cwb-Bugs-3046113 ] cqp sorting doesn't respect %c and %d in
utf8 corpora
SourceForge.net
noreply at sourceforge.net
Mon Aug 1 00:56:55 CEST 2011
Bugs item #3046113, was opened at 2010-08-16 09:48
Message generated for change (Settings changed) made by andrewhardie
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
>Group: TODO-3.5
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp sorting doesn't respect %c and %d in utf8 corpora
Initial Comment:
Almost certainly something to do with the interaction between %c and %d with the Glib function used for Unicode collation in special-chars.c.
* The Glib function does not seem to be respecting locales, contrary to my expectations from its documentation. Glib collation seems ot be always case sensitive, always accent-sensitive, and always binary (see bug report to list by Peter Ljunglöf, below). A locale might need to be declared in the registry file - and thus imported - for the right locale to be set on a per-corpus basis. Alternatively, other means of string collation?
----------Orig bug reprot:
2. %c doesn't make any difference for sorting:
MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;
27990: <Har jag inte>
9734: <Hur ska jag>
6831: <har jag inte>
14738: <hur ska jag>
MINISUC> sort X by word%c;
27990: <Har jag inte>
9734: <Hur ska jag>
6831: <har jag inte>
14738: <hur ska jag>
3. %d doesn't make any difference for sorting:
MINISUC> X = "h.r" "jag|och";
MINISUC> sort X by word
19666: <har jag>
21298: <hur jag>
34116: <här och>
27715: <hår och>
1112: <hör jag>
MINISUC> sort X by word%d
19666: <har jag>
21298: <hur jag>
34116: <här och>
27715: <hår och>
1112: <hör jag>
Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).
4. Another example for %d, using é, which in Swedish should come between e and f:
MINISUC> X = "id.*";
MINISUC> sort X by word;
958: <idag>
961: <ide>
964: <ideer>
967: <iden>
970: <idog>
973: <idé>
976: <idéer>
979: <idén>
MINISUC> sort X by word%d;
958: <idag>
961: <ide>
964: <ideer>
967: <iden>
970: <idog>
973: <idé>
976: <idéer>
979: <idén>
----------------------------------------------------------------------
Comment By: Andrew Hardie (andrewhardie)
Date: 2010-08-17 08:57
Message:
Not in utf8 mode they don't! (see cl_string_qsort_compare() in
special-chars.c) That was the big change in my most recent commit.
int result = (int)g_utf8_collate((gchar *)s1, (gchar *)s2);
(which - note to self - should almost certainly be
int result = (int)g_utf8_collate((gchar *)comp1, (gchar *)comp2);
in any case...)
We can guarantee case/diac INsensitivity by means of normalisation prior
to this call. But I don't know if we can guarantee sensitivity, if Glib
decides to be insensitive of its own accord. That's the issue I was
concerned about.
In retrospect, this is probably not the same issue as the problem Peter is
having - looks as if he has a non-most-recent version, where UTF8 collation
was done on binary, just as in Latin1, and %c %d just wouldn't work at all
for UTf8 (ie enforced insensitivity).
----------------------------------------------------------------------
Comment By: Stefan Evert (schtepf)
Date: 2010-08-17 08:41
Message:
Not sure whether this is relevant for the problem at hand, but %c and %d in
sort/count work by normalising strings to case/diacritic-folded form and
then doing a binary comparison on the normalised strings (just the same as
in CQP queries). So I'm not sure whether collation has anything to do with
this.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046113&group_id=131809
More information about the CWB
mailing list