[CWB] Unicode support in CWB version 3.2.b3

Peter Ljunglöf peter.ljunglof at gu.se
Mon Aug 16 11:23:09 CEST 2010


Hi,

15 aug 2010 kl. 18.07 skrev Hardie, Andrew:

> Just a quick note to let everyone know that the Unicode support features
> are (as of last weekend) now more-or-less complete with the addition of
> UTF8-aware sorting in CQP, charset-checking in cwb-encode, and proper
> instructions for adding the new external libraries to the build.

I've tested the Unicode support for a week now, and it works fine. I've had no problems with searching (but I haven't tested that much). But I have two other problems. The first is that the "characters" context isn't UTF8 aware:

MINISUC> "och"
      318: d sex-tiden kom artisterna <och> nu började också publik
      337: <A5>n jugoslaviska ambassaden <och> till de svenska artister 
      360: či , mučkalica , potatis <och> sallad , och vid en liten
      384:  placerades stora skärmar <och> på dem hängdes utställ
      392: <A4>llningen , med målningar <och> skisser av Filip Bulatovi
      397: kisser av Filip Bulatović <och> andra jugoslaviska konstn

As you can see, the lines vary in length, depending on the number of non-ascii characters. More important is that on lines 337 and 392, "å" and "ä" are cut in the middle, giving non-UTF8 characters <A5> and <A4>.

> Note, of course, that "complete" does not imply "bug-free", and there
> are three things in particular that I am anxious to check are working
> properlu. 
> 
> The first is sorting in UTF8 (it is not clear, in particular, that
> case-sensitive diacritic-sensitive sorting will behave as it should); 


1. Searching works fine:

MINISUC> "sas.*"%cd;
      827:  <Saša>
      936:  <såsom>
     1430:  <såsskedar>
     3545:  <såserna>
    20984:  <såsom>

2. %c doesn't make any difference for sorting:

MINISUC> X = "hur"%c "ska" "jag" | "har"%c "jag" "inte";
MINISUC> sort X by word;                                
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>
MINISUC> sort X by word%c;
    27990:  <Har jag inte>
     9734:  <Hur ska jag>
     6831:  <har jag inte>
    14738:  <hur ska jag>

3. %d doesn't make any difference for sorting:

MINISUC> X = "h.r" "jag|och";  
MINISUC> sort X by word
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>
MINISUC> sort X by word%d
    19666:  <har jag>
    21298:  <hur jag>
    34116:  <här och>
    27715:  <hår och>
     1112:  <hör jag>

Also, the sort order is not localized - in Swedish, z<å<ä<ö, but in CWB z<ä<å<ö. But I guess localization is difficult: in Swedish, åäö are not seen as diacritic variants, but áé... are diacritic. (And I don't think this is captured in the Unicode standard).

4. Another example for %d, using é, which in Swedish should come between e and f:

MINISUC> X = "id.*";
MINISUC> sort X by word;  
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>
MINISUC> sort X by word%d;
      958:  <idag>
      961:  <ide>
      964:  <ideer>
      967:  <iden>
      970:  <idog>
      973:  <idé>
      976:  <idéer>
      979:  <idén>

best, 
/Peter Ljunglöf

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet




More information about the CWB mailing list