[CWB] Encoding problem in CWB version 3.2.4
Stefan Evert
stefanML at collocations.de
Tue Sep 28 13:31:01 CEST 2010
On 28 Sep 2010, at 11:21, Peter Ljunglöf wrote:
> Yes, I can confirm that too. But only if I have a Latin-1 corpus, and only on the second query (and later):
>
> $ cqp -D MINISUC
> MINISUC> "xyzzyz";
> 0 matches.
> MINISUC> "xyzzyz";
> CQP Error:
> Query includes a character or character sequence that is invalid
> in the encoding specified for this corpus
Just for the curious: that is because the first command is executed as soon as the ";" marker has been read (by the parser -- you have to press enter to send it to CQP, of course), so there's no newline at this point. The first newline character is actually part of the second command, which fails.
> Nope, pure unix line endings. I have the cwb-encode latin-1 problems on my Mac with the latest subversion (CWB v3.2.4), but not on a Linux server with CWB 3.2.b1.
Strange, I thought I had tried with both encodings on my Mac; but perhaps I wasn't careful and/or thorough enough with my testing.
Anyway, many thanks to all bug reporters!
>> One negative side-effect is that filenames read and written from within CQP must be in the current corpus encoding. If anything, we should force people to use ASCII filenames, but not e.g. Latin1 on a Mac OS X computer. Any chance of improving the validation method in the mid to long run, or is that just too complicated?
>
> Mac OSX doesn't use Latin-1 filenames anyway, but UTF-8.
That was exactly my point, which I may not have expressed clearly enough: we don't want to force users on Mac OS X to write their filenames in Latin1 because the OS is going to choke on that -- but this is exactly what happens with the current validation and a Latin1 corpus.
Best,
Stefan
More information about the CWB
mailing list