[CWB] unicode problems with Greek and OCS

Tue Mar 10 21:41:50 CET 2015

> I am continuing to ponder...

Could we get access to the two offending corpora?  It's probably easier if we have the data so we can at least find out which strings are the culprits.  Otherwise Andrew has to continue poking around in the dark.

The error messages didn't show which words failed normalization.  The output that includes word forms is from a different part of the program checking internal consistency of the feature vectors (which shouldn't throw errors, of course, but that may be a side effect of the other failures).

I'm a bit surprised that we have fewer "invalid UTF8 string" messages in the second pass than in the first pass.  Both passes should be doing more or less the same thing.  Another reason why it would be useful to get hold of the corpus data.  The "word" attributes from both corpora would be sufficient.

Cheers,
Stefan