[CWB] invalid UTF8 string passed to cl_string_canonical...

Stefan Evert stefanML at collocations.de
Tue May 10 09:18:12 CEST 2016


> On 9 May 2016, at 17:38, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
> Fixing this bug is something that needs to be done but is going to be a right royal pain in the neck because it will mean fairly complex checking of the byte sequences – so not something I am going to have time for in the near future I’m afraid.

I've created a bug ticket for this problem.

There's an easy fix (simply apply cl_string_canonical to original string before extracting n-grams), but it would be better to re-implement the feature extraction so that non-Latin alphabets can also be supported.

Best,
Stefan


More information about the CWB mailing list