[CWB] [ cwb-Bugs-3046107 ] cqp: concordance output breaks utf8
	characters 
    SourceForge.net 
    noreply at sourceforge.net
       
    Tue Aug 17 11:57:27 CEST 2010
    
    
  
Bugs item #3046107, was opened at 2010-08-16 09:42
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046107&group_id=131809
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQP interface
Group: None
Status: Open
Resolution: None
Priority: 8
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cqp: concordance output breaks utf8 characters 
Initial Comment:
... and counts them wrong.
Worth trying to correct this in advance of major changes to the ****-output modules.
----------------------------------------------------------------------
>Comment By: Andrew Hardie (andrewhardie)
Date: 2010-08-17 09:57
Message:
Nothing in the tracker, but we have this on the Unicode roadmap:
"* re-implement character context in kwic output (cat command), where the
current implementation counts bytes instead of characters (and may thus
break MBCs in addition to failing to align query matches)
...
* interactive pager (cat, count, etc.) should automatically be configured
for UTF-8 or ISO-8859-X character set"
And later this, which is what I had in mind as the "major" overhaul:
"*Proper handling of fixed-character context in kwic output (cat) will
require a major rewrite
- affects all kwic-formatting code in cqp/output.c, cqp/print-modes.c,
ascii-print.c, html-print.c, latex-print.c, sgml-print.c, etc.
- this code is inefficient and seriously broken anyway (buffer overflow +
segfault for large context sizes), so it should be re-implemented from
scratch
- recommendation: drop HTML, Latex and SGML modes; just offer ASCII for
interactive use and XML as a general-purpose format (which can easily be
transformed to other formats using XSLT, Perl, etc.)"
My instinct was that we could perhaps fix character-splitting before
digging into things like buffer overflow and getting rid of latex, html
etc!
----------------------------------------------------------------------
Comment By: Stefan Evert (schtepf)
Date: 2010-08-17 08:38
Message:
Nope, this _is_ the major overhaul of the CQP kwic formatting code that is
so urgently needed.  Currently, it also breaks some of the shell escapes
for highlighting/colour, though I've tried hard to work around that.
Is there a bug tracker item for the ****-output overhaul? Should be set to
high priority and merged with this one.
----------------------------------------------------------------------
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3046107&group_id=131809
    
    
More information about the CWB
mailing list