[CWB] Unicode support in CWB version 3.2.b3
Peter Ljunglöf
peter.ljunglof at gu.se
Mon Aug 16 11:23:09 CEST 2010
Hi,
15 aug 2010 kl. 18.07 skrev Hardie, Andrew:
> Just a quick note to let everyone know that the Unicode support features
> are (as of last weekend) now more-or-less complete with the addition of
> UTF8-aware sorting in CQP, charset-checking in cwb-encode, and proper
> instructions for adding the new external libraries to the build.
I've tested the Unicode support for a week now, and it works fine. I've had no problems with searching (but I haven't tested that much). But I have two other problems. The first is that the "characters" context isn't UTF8 aware:
MINISUC> "och"
318: d sex-tiden kom artisterna <och> nu började också publik
337: <A5>n jugoslaviska ambassaden <och> till de svenska artister
360: či , mučkalica , potatis <och> sallad , och vid en liten
384: placerades stora skärmar <och> på dem hängdes utställ
392: <A4>llningen , med målningar <och> skisser av Filip Bulatovi
397: kisser av Filip Bulatović <och> andra jugoslaviska konstn
As you can see, the lines vary in length, depending on the number of non-ascii characters. More important is that on lines 337 and 392, "å" and "ä" are cut in the middle, giving non-UTF8 characters <A5> and <A4>.
> Note, of course, that "complete" does not imply "bug-free", and there
> are three things in particular that I am anxious to check are working
> properlu.
>
> The first is sorting in UTF8 (it is not clear, in particular, that
> case-sensitive diacritic-sensitive sorting will behave as it should);
1. Searching works fine:
MINISUC> "sas.*"%cd;
827: <Saša>
936: <såsom>
1430: <såsskedar>
3545: <såserna>
20984: <såsom>
2. %c doesn't make any difference for sorting:
MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;
27990: <Har jag inte>
9734: <Hur ska jag>
6831: <har jag inte>
14738: <hur ska jag>
MINISUC> sort X by word%c;
27990: <Har jag inte>
9734: <Hur ska jag>
6831: <har jag inte>
14738: <hur ska jag>
3. %d doesn't make any difference for sorting:
MINISUC> X = "h.r" "jag|och";
MINISUC> sort X by word
19666: <har jag>
21298: <hur jag>
34116: <här och>
27715: <hår och>
1112: <hör jag>
MINISUC> sort X by word%d
19666: <har jag>
21298: <hur jag>
34116: <här och>
27715: <hår och>
1112: <hör jag>
Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).
4. Another example for %d, using é, which in Swedish should come between e and f:
MINISUC> X = "id.*";
MINISUC> sort X by word;
958: <idag>
961: <ide>
964: <ideer>
967: <iden>
970: <idog>
973: <idé>
976: <idéer>
979: <idén>
MINISUC> sort X by word%d;
958: <idag>
961: <ide>
964: <ideer>
967: <iden>
970: <idog>
973: <idé>
976: <idéer>
979: <idén>
best,
/Peter Ljunglöf
________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet
More information about the CWB
mailing list