[CWB] invalid UTF8 string passed to cl_string_canonical...
Stefan Evert
stefanML at collocations.de
Tue May 10 09:18:12 CEST 2016
> On 9 May 2016, at 17:38, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
>
> Fixing this bug is something that needs to be done but is going to be a right royal pain in the neck because it will mean fairly complex checking of the byte sequences – so not something I am going to have time for in the near future I’m afraid.
I've created a bug ticket for this problem.
There's an easy fix (simply apply cl_string_canonical to original string before extracting n-grams), but it would be better to re-implement the feature extraction so that non-Latin alphabets can also be supported.
Best,
Stefan
More information about the CWB
mailing list