[CWB] invalid UTF8 string passed to cl_string_canonical...

Thu May 12 03:32:40 CEST 2016

Having had a poke...

I am not so sure about the easy fix. There are only 4 calls to cl_string_canonical in feature_maps.c. Here they are with a line before:

@ 286
              s = (unsigned char *) cl_strdup(cl_id2str(w_attr1, i));
              cl_string_canonical( (char *)s, charset, IGNORE_CASE | IGNORE_DIAC);
@ 295
              s = (unsigned char *) cl_strdup(cl_id2str(w_attr2, i));
              cl_string_canonical( (char *)s, charset, IGNORE_CASE | IGNORE_DIAC);
@ 508 
              s_orig = s = (unsigned char *) cl_strdup(cl_id2str(w_attr1, i));
              cl_string_canonical( (char *)s, charset, IGNORE_CASE | IGNORE_DIAC);
@ 526
              s_orig = s = (unsigned char *) cl_strdup(cl_id2str(w_attr2, i));
              cl_string_canonical( (char *)s, charset, IGNORE_CASE | IGNORE_DIAC);

Every time it is the same pattern - the main (1st) argument to cl_string_canonical is a string that has, the line before, been copied out of the lexicon. There's no possibility that those strings could have been sliced up.

I had a hunt for calls to other CL funcs that might in turn call cl_string_canonical but could not find any...

So, it looks to me like the "quick fix" -- " apply cl_string_canonical to original string " -- turns out to actually be the current situation.

Weird....

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 10 May 2016 08:18
To: CWBdev Mailing List
Subject: Re: [CWB] invalid UTF8 string passed to cl_string_canonical...

> On 9 May 2016, at 17:38, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
> Fixing this bug is something that needs to be done but is going to be a right royal pain in the neck because it will mean fairly complex checking of the byte sequences – so not something I am going to have time for in the near future I’m afraid.

I've created a bug ticket for this problem.

There's an easy fix (simply apply cl_string_canonical to original string before extracting n-grams), but it would be better to re-implement the feature extraction so that non-Latin alphabets can also be supported.

Best,
Stefan
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb