[CWB] unicode problems with Greek and OCS

Wed Mar 11 10:40:41 CET 2015

I have been thinking and the number of "invalid UTF8" messages seems to be important. To quote:

------
FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[21952]
FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[614656]
<snip>
PASS 2: Processing 3-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Processing 4-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Creating character counts.
------

- on the first pass, there are THREE during the calculation of 3grams and FOUR during the calculation of 4grams

This strongly suggests that the call to cl_string_canonical is happening during a "for (i=0;i<n;i++)" loop. But I have spent quite a lot of time last night searching and I can't find such a loop - or rather, I can, but none of those loops calls cl_string_canonical.

- on the second pass there are TWO during 3grams and then another TWO during 4grams

which suggests that the call is NOT in such a loop.

But I have been staring at the two bits of code for pass 1 and pass 2 with no result. The 4 obvious calls to cl_string_canonical are in the wrong place for this to make any sense (they are in loops across lexical items), and I cannot identify any calls in the right place to functions that might then call cl+string_canonical.

Unless I am reading too much in to the timing of the error messages? (since the above c&p is, of course, a mixture of  stdout and stderr...)

Stefan, any thoughts?

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: 10 March 2015 21:18
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] unicode problems with Greek and OCS

Ruprecht sent me his Greek lexicon files, and I am currently writing a C program to test the results of 3 GLib calls on each item (validate, normalise decompose, normalise compose).

That's what I meant by pondering!

... and I've just finished and there are no errors. 

@RUPRECHT - could you send me the lexicons for the other corpus maybe? that might help. Or alternatively run the test yourself? See below.

> I'm a bit surprised that we have fewer "invalid UTF8 string" messages in the second pass than in the first pass.  

Yes, that is perplexing

Andrew.

PS. Here's the test code. 
findout.c
#include <stdio.h>
#include <stdlib.h>
#include <glib.h>

int main(int argc, char *argv[])
{
        FILE *src;
        char buf[1024];
        char *mark, *comp, *decomp;
        int i;

        src = fopen("word.lexicon", "r");
        buf[0] = 0;
        mark = buf;

        for (i = 0; 1 ; i++)
        {
                if (1 > fread(mark, 1, 1, src))
                        break;
                if ('\0' == *mark)
                {
                        if (!g_utf8_validate((gchar *)buf, -1, NULL))
                                printf("Error (validate returns false)     in item %d [%s]\n", i, buf);
                        if (NULL == (decomp = g_utf8_normalize((gchar *)buf,    -1, G_NORMALIZE_NFD)) )
                                printf("Error (Decompose NFD returns null) in item %d [%s]\n", i, buf);
                        if (NULL == (comp   = g_utf8_normalize((gchar *)decomp, -1, G_NORMALIZE_NFC)) )
                                printf("Error (Recompose NFC returns null) in item %d [%s]\n", i, buf);
                        if (comp) free(comp);
                        if (decomp) free(decomp);
                        buf[0] = 0;
                        mark = buf;
                }
                else
                        ++mark;
                if (i % 1000) ; else printf("done %d!\n", i);
        }
        fclose(src);

        return 0;
}
gcc `pkg-config --libs --cflags glib-2.0` findout.c

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 10 March 2015 20:42
To: CWBdev Mailing List
Subject: Re: [CWB] unicode problems with Greek and OCS

> I am continuing to ponder...

Could we get access to the two offending corpora?  It's probably easier if we have the data so we can at least find out which strings are the culprits.  Otherwise Andrew has to continue poking around in the dark.

The error messages didn't show which words failed normalization.  The output that includes word forms is from a different part of the program checking internal consistency of the feature vectors (which shouldn't throw errors, of course, but that may be a side effect of the other failures).

I'm a bit surprised that we have fewer "invalid UTF8 string" messages in the second pass than in the first pass.  Both passes should be doing more or less the same thing.  Another reason why it would be useful to get hold of the corpus data.  The "word" attributes from both corpora would be sufficient.

Cheers,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb