[CWB] Encoding problem in CWB version 3.2.4

Tue Sep 28 11:21:29 CEST 2010

28 sep 2010 kl. 10.54 skrev Stefan Evert:

> Ah, I see the problem now, and can reproduce Tomaz's problem if I run cqp without "-e" (something one should _never_ do, of course ;-).  With command-line editing enabled, the newline is stripped before each command is passed to CQP.

Yes, I can confirm that too. But only if I have a Latin-1 corpus, and only on the second query (and later):

$ cqp -D MINISUC
MINISUC> "xyzzyz";
0 matches.
MINISUC> "xyzzyz";
CQP Error:
	Query includes a character or character sequence that is invalid
in the encoding specified for this corpus

I have are no problems on UTF-8 corpora, and not if I use "-e".

> In my book, cl_string_validate_encoding() should only be called on "object" strings and regular expressions passed to the CL, not on complete queries and input lines.
> 
> I recall adding the TAB-exception to INVALID_CTRL specifically so we could run it on the full input lines in cwb-encode, so I'm clearly not true to myself.  However, I don't think I had problems with the -C option like those reported by Peter, so perhaps his files have Windows line endings that don't get stripped properly on a Unix platform?

Nope, pure unix line endings. I have the cwb-encode latin-1 problems on my Mac with the latest subversion (CWB v3.2.4), but not on a Linux server with CWB 3.2.b1.

> The easy solution is to make INVALID_CTRL more lenient so that it allows CR and LF characters as well -- that should cover everything that's allowed to pop up in CQP queries, right?
> 
> I think it would still be better to do the validation only for those strings that are actually used by CL functions.  I'm a little bit queasy about the use of QueryBuffer -- IIRC, that's a bad hack I designed for my macro expansion technique, and I've never been confident that the whole thing works correctly :-).
> 
> One negative side-effect is that filenames read and written from within CQP must be in the current corpus encoding.  If anything, we should force people to use ASCII filenames, but not e.g. Latin1 on a Mac OS X computer.  Any chance of improving the validation method in the mid to long run, or is that just too complicated?

Mac OSX doesn't use Latin-1 filenames anyway, but UTF-8. But there are filename problems between operating systems, even if they use UTF-8, since they use different normalizations (NFD on Mac and NFC on Linux).

/Peter

> @Tomaz: If you get these error messages, that means your corpus encoding is not properly declared as "utf8", but rather as "latin1" (which may be implicitly set for an undeclared encoding?). Can you check your registry file or the output of CQP's "info" command?  Did you automatically generate the makefile with cwb-encode's -R option?  This is very much recommended to ensure consistency of the charset declaration.
> 
> Cheers,
> Stefan
> 
> On 28 Sep 2010, at 10:18, Hardie, Andrew wrote:
> 
>> @Stefan - it's probably the INVALID_CTRL macro in special-chars. It
>> treats anything under 0x20 as invalid, which I think you put in because
>> you can't have \n or \r in CWB tokens. 
>> 
>> #define INVALID_CTRL(c) (c < 0x20 && c != 0x09)
>> 
>> BUT when that same function is used to validate query strings it will
>> fail - they have a \n on the end because the user just pressed ENTER. I
>> *think* that's it, anyway. The parser code is a source of immense and
>> continuing mysrtery to me.
>> 
>> (Also, I believe I made cwb-encode run the validation check on entire
>> input lines, not on individual tokens - so that may also now be a bug if
>> the line break is not trimmed off before that happens.)
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet