[CWB] INVALID_CTRL marking \n wrongly? (schtepf)

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Jan 20 05:03:52 CET 2011

A follow up: this is all done and  INVALID_CTRL is no more. A separate
function, cl_string_zap_controls(), is now invoked by the -C flag in
cwb-encode; it has the newline/tab configurability Stefan wanted and is
exposed in the CL API.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
On Behalf Of Stefan Evert
Sent: 06 January 2011 07:53
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] INVALID_CTRL marking \n wrongly? (schtepf)

> Yes, known issue. Me & Stefan were actually talking about precisely
this at the start of Oct when term hit and we suddenly had no more time
for programming. 
> The current situation is clearly wrong BUT there are certain
implications regarding parity of treatment of C0 control chars in Latin1
vs utf8 so it's not obvious what the Right Thing is. 

The conclusion at the time was that Andrew is right and I've just been
too much of a pedantic bureaucrat when I added that code.  I had been
firmly convinced that the change had been reverted in the mean time --
that's why it didn't occur to me this could be your problem -- but
apparently both of us in fact stopped coding after we'd reached the

@Andrew: I'm in favour of a separate "cleanup" flag for cwb-encode,
which deletes or rewrites all invalid and control characters, so
careless users don't run into nasty surprises.  (The underlying function
should probably take a flag that decides whether TABs and newlines are
invalid or not.)

Cheers & thanks for the quick fix, Andrew,

CWB mailing list
CWB at sslmit.unibo.it

More information about the CWB mailing list