[CWB] unicode problems with Greek and OCS

Tue Mar 10 22:18:08 CET 2015

Ruprecht sent me his Greek lexicon files, and I am currently writing a C program to test the results of 3 GLib calls on each item (validate, normalise decompose, normalise compose).

That's what I meant by pondering!

... and I've just finished and there are no errors. 

@RUPRECHT - could you send me the lexicons for the other corpus maybe? that might help. Or alternatively run the test yourself? See below.

> I'm a bit surprised that we have fewer "invalid UTF8 string" messages in the second pass than in the first pass.  

Yes, that is perplexing

Andrew.

PS. Here's the test code. 
findout.c
#include <stdio.h>
#include <stdlib.h>
#include <glib.h>

int main(int argc, char *argv[])
{
        FILE *src;
        char buf[1024];
        char *mark, *comp, *decomp;
        int i;

        src = fopen("word.lexicon", "r");
        buf[0] = 0;
        mark = buf;

        for (i = 0; 1 ; i++)
        {
                if (1 > fread(mark, 1, 1, src))
                        break;
                if ('\0' == *mark)
                {
                        if (!g_utf8_validate((gchar *)buf, -1, NULL))
                                printf("Error (validate returns false)     in item %d [%s]\n", i, buf);
                        if (NULL == (decomp = g_utf8_normalize((gchar *)buf,    -1, G_NORMALIZE_NFD)) )
                                printf("Error (Decompose NFD returns null) in item %d [%s]\n", i, buf);
                        if (NULL == (comp   = g_utf8_normalize((gchar *)decomp, -1, G_NORMALIZE_NFC)) )
                                printf("Error (Recompose NFC returns null) in item %d [%s]\n", i, buf);
                        if (comp) free(comp);
                        if (decomp) free(decomp);
                        buf[0] = 0;
                        mark = buf;
                }
                else
                        ++mark;
                if (i % 1000) ; else printf("done %d!\n", i);
        }
        fclose(src);

        return 0;
}
gcc `pkg-config --libs --cflags glib-2.0` findout.c

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 10 March 2015 20:42
To: CWBdev Mailing List
Subject: Re: [CWB] unicode problems with Greek and OCS

> I am continuing to ponder...

Could we get access to the two offending corpora?  It's probably easier if we have the data so we can at least find out which strings are the culprits.  Otherwise Andrew has to continue poking around in the dark.

The error messages didn't show which words failed normalization.  The output that includes word forms is from a different part of the program checking internal consistency of the feature vectors (which shouldn't throw errors, of course, but that may be a side effect of the other failures).

I'm a bit surprised that we have fewer "invalid UTF8 string" messages in the second pass than in the first pass.  Both passes should be doing more or less the same thing.  Another reason why it would be useful to get hold of the corpus data.  The "word" attributes from both corpora would be sufficient.

Cheers,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb