[CWB] [ cwb-Bugs-3046107 ] cqp: concordance output breaks utf8 characters

SourceForge.net noreply at sourceforge.net
Mon Aug 1 00:57:28 CEST 2011


Bugs item #3046107, was opened at 2010-08-16 09:42
Message generated for change (Settings changed) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046107&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQP interface
>Group: TODO-3.5
Status: Open
Resolution: None
>Priority: 9
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp: concordance output breaks utf8 characters 

Initial Comment:
... and counts them wrong.

Worth trying to correct this in advance of major changes to the ****-output modules.



----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2010-08-17 09:57

Message:
Nothing in the tracker, but we have this on the Unicode roadmap:

"* re-implement character context in kwic output (cat command), where the
current implementation counts bytes instead of characters (and may thus
break MBCs in addition to failing to align query matches)
...
* interactive pager (cat, count, etc.) should automatically be configured
for UTF-8 or ISO-8859-X character set"

And later this, which is what I had in mind as the "major" overhaul:

"*Proper handling of fixed-character context in kwic output (cat) will
require a major rewrite
- affects all kwic-formatting code in cqp/output.c, cqp/print-modes.c,
ascii-print.c, html-print.c, latex-print.c, sgml-print.c, etc.
- this code is inefficient and seriously broken anyway (buffer overflow +
segfault for large context sizes), so it should be re-implemented from
scratch
- recommendation: drop HTML, Latex and SGML modes; just offer ASCII for
interactive use and XML as a general-purpose format (which can easily be
transformed to other formats using XSLT, Perl, etc.)"

My instinct was that we could perhaps fix character-splitting before
digging into things like buffer overflow and getting rid of latex, html
etc!


----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2010-08-17 08:38

Message:
Nope, this _is_ the major overhaul of the CQP kwic formatting code that is
so urgently needed.  Currently, it also breaks some of the shell escapes
for highlighting/colour, though I've tried hard to work around that.

Is there a bug tracker item for the ****-output overhaul? Should be set to
high priority and merged with this one.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046107&group_id=131809


More information about the CWB mailing list