[CWB] [ cwb-Feature Requests-3585285 ] Make cwb-encode handle non-POSIX (win32) linebreaks

SourceForge.net noreply at sourceforge.net
Thu Nov 8 03:52:39 CET 2012


Feature Requests item #3585285, was opened at 2012-11-07 18:52
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=3585285&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CWB engine
Group: TODO-3.5
Status: Open
Priority: 5
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: Make cwb-encode handle non-POSIX (win32) linebreaks

Initial Comment:
Moving CWB input text files between Win and *nix can result in CRLF (0x0d, 0x0a) linebreaks being input: if this happens, the CR is encoded as part of the final p-attribute on each line. cwb-encode should be able to spot this and work round it (likewise, in the Win build, be able to cope with POSIX line-breaks; this may already work, but needs checking).

Suggestions for fixing it by Stefan:

 - We could extend -B to remove all whitespace characters around tokens, not just blanks.

 - We should probably change line #46 of cwb-encode.c to 

	#define FIELDSEPS  "\t\n\r"

These solutions need evaluating and one or both implementing for v 3.5.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=3585285&group_id=131809


More information about the CWB mailing list