[CWB] [cwb:feature-requests] #47 Make cwb-encode handle non-POSIX (win32) linebreaks

Stefan Evert schtepf at users.sf.net
Sat Jul 1 14:55:00 CEST 2017


New suggestion: when reading lines in cwb-encode (as well as cwb-s-encode and cwb-align-encode), strip trailing CR as well as BOM at start of line (only if in utf8 mode).

It would be nice to do this in a function cl_gets (which also cuts off after CL_MAX_LINE_LENGTH characters) so other file input becomes more robust, too.  Would still require specification of charset or a flag that determines whether utf8 BOM may be removed at start of line.


---

** [feature-requests:#47] Make cwb-encode handle non-POSIX (win32) linebreaks**

**Status:** open
**Group:** TODO-3.5
**Labels:** CWB engine 
**Created:** Thu Nov 08, 2012 02:52 AM UTC by Andrew Hardie
**Last Updated:** Sat Jul 01, 2017 12:48 PM UTC
**Owner:** Andrew Hardie


Moving CWB input text files between Win and \*nix can result in CRLF \(0x0d, 0x0a\) linebreaks being input: if this happens, the CR is encoded as part of the final p-attribute on each line. cwb-encode should be able to spot this and work round it \(likewise, in the Win build, be able to cope with POSIX line-breaks; this may already work, but needs checking\).

Suggestions for fixing it by Stefan:

\- We could extend -B to remove all whitespace characters around tokens, not just blanks.

\- We should probably change line \#46 of cwb-encode.c to 

	\#define FIELDSEPS  "\t\n\r"

These solutions need evaluating and one or both implementing for v 3.5.


---

Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/feature-requests/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/feature-requests/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170701/fdb36538/attachment.html>


More information about the CWB mailing list