[CWB] number and <text_id> tag inside a word search

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Feb 22 00:32:48 CET 2016


I’ve just added a check for initial BOM to the “[fix linebreaks]” tool in the CQPweb Admin interface. This might help some other users avoid falling into this issue, though it wouldn’t have helped in this case, obviously.

But note, Daniel, that cwb-encode is actually already programmed to delete the EF-BB-BF sequence if it finds it at the start of a file – but only when the corpus encoding is declared to be UTF-8. You disabled this check by using “-c latin1” .

(Stefan, for reference, this is on line 1625 of cwb-encode.c).

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 21 February 2016 21:33
To: Daniel Renau
Cc: CWBdev Mailing List
Subject: Re: [CWB] number and <text_id> tag inside a word search

[CC: to the mailing list in case other people run into the same problem]


On 21 Feb 2016, at 21:38, Daniel Renau <alphak87 at gmail.com<mailto:alphak87 at gmail.com>> wrote:

Done!
[cid:image002.jpg at 01D16CFF.9737C590]

I erased the 4 first hex pairs... and it works well now :)


EF BB BF ist the byte-order mark in UTF-8 … the root of all evil!  BOMs are chronically inserted by Windows editor programs (but by hardly any other software), and they're quite hard to get rid off.

While  CWB should really understand and skip the BOM at the start of a UTF-8 file, once you cat together several such file (e.g. feeding them to CWB from stdin), you produce illegal input with BOMs littered throughout the text.

Best,
Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160221/1031421d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 17003 bytes
Desc: image002.jpg
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160221/1031421d/attachment-0001.jpg>


More information about the CWB mailing list