[CWB] Concordance printing -- opinions?

Tue Sep 21 12:13:04 CEST 2010

Thanks for bringing this topic up, Andrew!

> While the topic has arisen: Stefan and I have been planning out the big
> revision of the concordance printing functions. Basically, in place of
> the multiple modes now available, there will be two: XML mode and text
> mode. (So, no HTML, SGML or Latex!)

Actually, I have put a little thought into this in the meantime.  Basically, I see two possible strategies, both of which involve a considerable amount of work, so we should make sure to implement something that's really useful in the long term.

__________
Version A: Keep the kwic-formatting subsystem mostly as is, and just clean up the implementation and make it as configurable as possible within this framework.  The current kwic formatter (and similar code for "group" output etc.) uses a common algorithm to determine context size and put together kwic lines.  Different output modes are implemented by specifying strings that are inserted in various places in the kwic line, e.g. to separate attributes, separate tokens, before/after XML tags, etc.  By clever use of these strings, you can produce output the looks very much like HTML, SGML or LaTeX.

In the implementation I envision, we would make these "boilerplate" strings completely user configurable (I'm thinking about reading them from a file in YAML-like syntax), and provide a few built-in output modes.  In addition to specifying boilerplate strings, the output mode definition would choose from a range of built-in escaping functions (HTML, XML, LaTeX, ...) to take care of special characters.  I would also like to scrap print options, which are quite messy in my opinion, which I've hardly ever used myself, and which make it more complicated to implement new print modes.

This would probably be the easiest and fastest approach, as we don't have to spend much time thinking about what features we want from an ideal kwic formatter; we'd also be able to keep parts of the existing code at least for now and then gradually refactor the ugly bits.  However, it still suffers from several drawbacks of the old system:

 - fixed-character context is _really_ tricky to get right; this is also the reason why the current "cat" output crashes on very long lines (y'know, collecting data in fixed-length buffers without checking for overflow and all that) and is much slower than would be necessary (I think)

 - limited flexibility of the entire "insert boilerplate here"-approach: basically, you always get a traditional kwic line wrapped in HTML, LaTeX, etc.; it's usually impossible to reformat the output in a more "native" way

 - the shared kwic formatter is rather complex, as it has to deal with all the special requirements of the different output formats

 - highlighting and colours in the console output (text mode) are a bad mess of ugly hacks, in order to work around the quirks of Unix terminals (it isn't like HTML, where you can just write <b> ... </b> to switch bold font on and off)

__________
Version B: Reimplement kwic formatting from scratch, offering only two output modes: XML and plain text.

Plain text output would be targeted exclusively at interactive terminals and would have a special implementation (i) to produce correct and robust highlighting/colours and (ii) to display fixed-character context efficiently.  It wouldn't be intended for further automatic processing, offering no special escaping tricks to produce unambiguous output.  For most automatic processing needs, "tabulate" is a much better starting point than "cat" anyway.

XML output would be a standardised, precisely defined intermediate format for further processing by a GUI front-end etc.  This format need not be user-configurable, since all necessary transformations can easily be achieved with an XSLT stylesheet, Perl script, etc.  It would be shared between cwb-decode and CQP, possibly using a common implementation within the low-level library.

Importantly, the XML output would _not_ support fixed-character context -- that is just a form of presentation, not a sensible definition of context size.  It's easy enough to produce the required display with a short Perl or Python script, anyway.

> What we'd like to know is if there are any preferences on these fronts
> amongst users - what should the format be like, what options should be
> available, etc. Obviously, we can't offer infinite complexity of
> configuration - so we'd like to know what people consider to be the most
> useful and important options with regards to the display of a
> concordance.

Thanks to Yannick for a first opinion.  I think we all agree on the well-defined XML output mode, which can be provided by both implementation strategies I've sketched above.  I also agree that cwb-decode -X and CQP's XML output should use exactly the same format.

Concerning his second suggestion ("escaped text mode") this could possibly be included in Version A above -- although it's less straightforward, since the escaping function has to know the current settings for separator characters (or multi-character strings!) and then dynamically escape all possible ambiguities.  Also, I'm not sure how easy it is to convert the escape sequences back to regular characters in further processing -- perhaps numeric XML entities would be the most widely-supported format here?

In Version B, this would not be an issue, as the text format is not intended for further processing anyway.

BTW, an entirely different strategy I toyed with several years ago -- and which I've always found very appealing -- is to implement a single, unambiguous, machine-readable output format (XML, YAML, or some form of TAB-delimited text records), and then embed a suitable interpreter language in CQP (XSLT, Perl, Python, Lua, ...) that can be used to generate the different output formats.  Producing nicely formatted text or HTML output with fixed-character context in Perl is a breeze -- in C it's weeks of pain.  The problem, of course, is to find a sufficiently light-weight interpreter that can be embedded in the CWB; or we would have to require users to install whatever interpreter we use and run it as an external process.

> Comments and suggestions welcome!

I can only repeat this!

Best wishes,
Stefan