[CWB] Concordance printing -- opinions?

Yannick Versley yversley at gmail.com
Tue Sep 21 14:00:39 CEST 2010


Hi all,

> Plain text output would be targeted exclusively at interactive terminals and would have a special implementation (i) to produce correct and robust highlighting/colours and (ii) to display fixed-character context efficiently.  It wouldn't be intended for further automatic processing, offering no special escaping tricks to produce unambiguous output.  For most automatic processing needs, "tabulate" is a much better starting point than "cat" anyway.
Agreed. I did not mean the machine-readable text as a replacement for
default KWIC display
(which, IMO, would be just fine without any configurable tricks).

> XML output would be a standardised, precisely defined intermediate format for further processing by a GUI front-end etc.  This format need not be user-configurable, since all necessary transformations can easily be achieved with an XSLT stylesheet, Perl script, etc.  It would be shared between cwb-decode and CQP, possibly using a common implementation within the low-level library.
>
> Importantly, the XML output would _not_ support fixed-character context -- that is just a form of presentation, not a sensible definition of context size.  It's easy enough to produce the required display with a short Perl or Python script, anyway.

> BTW, an entirely different strategy I toyed with several years ago -- and which I've always found very appealing -- is to implement a single, unambiguous, machine-readable output format (XML, YAML, or some form of TAB-delimited text records), and then embed a suitable interpreter language in CQP (XSLT, Perl, Python, Lua, ...) that can be used to generate the different output formats.  Producing nicely formatted text or HTML output with fixed-character context in Perl is a breeze -- in C it's weeks of pain.  The problem, of course, is to find a sufficiently light-weight interpreter that can be embedded in the CWB; or we would have to require users to install whatever interpreter we use and run it as an external process.

How about having an environment variable CQP_DISPLAY_SCRIPT that can be set to a
program which is then called, and sufficient information (say, the
corpus name via command line and the offsets in dump/undump format, or
just some machine-readable XML) and can then do anything ranging from
writing text to a file to preparing paginated HTML and displaying it
in a browser)?

Implementing complex functionality in a higher-level language would be
indeed tempting,
especially for those languages that already have a CWB::CL wrapper.
(My being tempted
is severely limited by my lack of knowledge about CQP's guts, however :-( ).
At least on Linux, it's safe to assume that Python and Perl are
installed, since lots of programs depend on it (mysql, dselect and
memcached depend on perl and things like update-manager, samba4 and
selinux-policy-ubuntu depend on python in Ubuntu), but the outlook may
be bleak on Windows. Besides the additional dependency, doing things
like cat in perl/python may also suffer from the inferior speed wrt C/C++.
Lua is quite useful for embedding with its small footprint, and fast,
but I doubt that many CL people would be more productive with it than
with C++. Using one of the Javascript libraries
(MozillaJS, V8) would give both speed and a well-known language, but
I'd guess they
are somewhat difficult to embed.

Best,
Yannick


More information about the CWB mailing list