[CWB] Concordance printing -- opinions?

Yannick Versley yversley at gmail.com
Mon Sep 20 20:23:40 CEST 2010


Dear Andrew,

here's my two cents on XML mode and text mode:
(a) It would be really cool if the XML output and cwb-encode/cwb-decode
used a vaguely similar format, with attributes for the P- and S-attributes,
such as
<line match="720" matchend="724">
<tok word="Egerland" pos="NE">
instead of
<LINE><MATCHNUM>720</MATCHNUM><CONTENT><TOKEN>Egerland/NE</TOKEN>
(i.e., with a tag for each token instead of tab-separated or
/-separated columns,
possibly with the line and tok tags living in their own namespace to
avoid confusion)
since this would allow for more expressive treatment of CQP or
cwb-decode output using
XQuery/XSLT/whatnot. (Pardon me if CQP already does this and I missed it).

(b) For the text mode, it would be nice to have a mode where you can have
arbitrary special characters in the tokens which are then escaped.
For example, if you have a token "a/b" with pos "$/" and a separator
"/", you could get
something like
a\u002fb/$\u002f  (Python unicode escape)
or
a\x{2f}b/$\x{2f}  (Perl unicode escape)
or
a&#2f;b/$&#2f;  (XML charref)
as output.
Using such escapes would also make it possible to process the full
unicode range with byte-based encodings such as ASCII or the
latin1..15 encodings.
(Basically, this would output exactly the string you put in as long as
it is not a
separator character (/ or \n in our case), the escape character -- \
or & in this case --
or something that is not representable by the charset, and an escape
group otherwise).
Being able to select an encoding for the output globally (i.e., in a
config file or in the
command line) that is respected at least for the text output (but
possibly also for the
XML output) would be nice.

Best,
Yannick

On Mon, Sep 20, 2010 at 7:02 PM, Hardie, Andrew
<a.hardie at lancaster.ac.uk> wrote:
> While the topic has arisen: Stefan and I have been planning out the big
> revision of the concordance printing functions. Basically, in place of
> the multiple modes now available, there will be two: XML mode and text
> mode. (So, no HTML, SGML or Latex!)
>
> What we'd like to know is if there are any preferences on these fronts
> amongst users - what should the format be like, what options should be
> available, etc. Obviously, we can't offer infinite complexity of
> configuration - so we'd like to know what people consider to be the most
> useful and important options with regards to the display of a
> concordance.
>
> Comments and suggestions welcome!
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
> On Behalf Of Lukas Michelbacher
> Sent: 18 September 2010 16:04
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Display options for structural attributes
>
>> For automatic processing, "tabulate" is often more convenient than
> tweaking
>>  the output of "cat".  For instance, you can get exactly the same
> information
>>  in nice TAB-delimited form with
>>       tabulate Last match, match .. matchend word, match .. matchend
> pos, match story_num;
>
> Perfect, thanks! That's exactly what I was looking for.
>
> Lukas
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>


More information about the CWB mailing list