[CWB] CQP output with PrintMode SGML
Hardie, Andrew
a.hardie at lancaster.ac.uk
Sun Aug 15 12:36:10 CEST 2010
Hi Peter,
Thanks for the patch. You're right, the current situation looks very messy and needs to change; I didn't touch the print modules at all in the recent updates to CQP other than to check for Windows incompatibilities. I know that Stefan wanted to do a major overhaul of ALL the print-output modes (if I recall correctly, the plan is cutting them back to two - "plain text" and SGML) and there is certainly lots of ugliness in the current SGML setup - not just the use of / as a separator, but also the use of HTML-style tags for tables, and the fact that the SGML is not XML-compatible though it easily could be by adding a few end-tags and quotes around att-vals. But the attribute-divider change can be done on its own prior to major fiddling. I'll add it to the todo list!
Best
Andrew.
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Peter Ljunglöf
> Sent: 14 August 2010 07:15
> To: cwb at sslmit.unibo.it
> Subject: [CWB] CQP output with PrintMode SGML
>
> Hi developers,
>
> I have started to write a Python wrapper to CQP, and it seems
> to me that the best KWIC output format is SGML. Then I can
> extract the KWIC information into a Python object.
>
> However, there are two (or three) problems with SGML output
> which makes it difficult:
>
> 1. The attribute separator is "/", which is a problem when
> the word or an attribute contains "/", e.g.:
>
> CORPUS> show +lemma;
> CORPUS> "1/2";
> <CONCORDANCE>
> <attribute type=positional name="word" anr=0> <attribute
> type=positional name="lemma" anr=1>
> <LINE>(...)<CONTENT>(...)
> <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
> (...)
> </CONCORDANCE>
>
> As you can see, it's impossible to extract the attributes
> from the SGML. My suggestion is to use "<ATTR>" as the
> attribute separator instead, which will work since "<" and
> ">" are SGML escaped.
>
> 2. When using an aligned corpus, the SGML in the aligned text
> is escaped:
>
> CORPUS_SWE> show +corpus_nld -lemma;
> CORPUS_SWE> "veranda";
> <CONCORDANCE>
> <attribute type=positional name="word" anr=0>
> <LINE>(...)<CONTENT> (...)
> <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE>
> <align name="saltnld_nld"><CONTENT> (...)
> <TOKEN>veranda</TOKEN> (...)
> <TOKEN>.</TOKEN> </CONTENT>
> (...)
> </CONCORDANCE>
>
> My suggestion is of course that the aligned text should not
> be escaped. ALso, that an "</align>" be printed in the end.
>
> 3. A smaller problem (and not a bug at all), is that the rows
> in the group output are contextual:
>
> CORPUS> X = "de" [];
> CORPUS> group X matchend lemma by match pos cut 50;
> <TABLE>
> <TR><TD>DT<TD>__UNDEF__<TD>152</TR>
> <TR><TD>PN<TD>vara<TD>146</TR>
> <TR><TD> <TD>ha<TD>117</TR>
> <TR><TD> <TD>skola<TD>100</TR>
> <TR><TD> <TD>inte<TD>89</TR>
> <TR><TD> <TD>komma<TD>80</TR>
> <TR><TD>DT<TD>mången<TD>71</TR>
> <TR><TD> <TD>där<TD>61</TR>
> <TR><TD> <TD>andra,annan,två<TD>52</TR>
> </TABLE>
>
> The 3rd row 1st column contains " ", which is a way of
> saying "the same as above". This is okay for ascii output and
> HTML output, but SGML is designed for computer readability,
> so personally I think that it shouldn't refer to earlier
> rows. Similar to "PrettyPrint off", which only works for
> "PrintMode ascii"...
>
> My suggestion is that the group printer only prints if
> PrettyPrint is on.
>
> 4. I did some digging in the source code, and it was pretty
> easy to do the necessary changes. (Kudos to the programmers
> for making the code readable). Only 4 lines are affected,
> here's a diff:
>
> Index: sgml-print.c
> ===================================================================
> --- sgml-print.c (revision 182)
> +++ sgml-print.c (working copy)
> @@ -77,7 +77,7 @@
>
> "<TOKEN>", /* BeforeToken */
> " ", /* TokenSeparator */
> - "/", /* AttributeSeparator */
> + "<ATTR>", /* AttributeSeparator */
> "</TOKEN>", /* AfterToken */
>
> "<CONTENT>", /* BeforeField */
> @@ -213,7 +213,8 @@
> sgml_puts(stream, "<align name=\"", 0);
> sgml_puts(stream, attribute_name, 0);
> sgml_puts(stream, "\">", 0);
> - sgml_puts(stream, line, SUBST_ALL);
> + sgml_puts(stream, line, 0);
> + sgml_puts(stream, "</align>", 0);
>
> fputc('\n', stream);
> }
> @@ -431,7 +432,7 @@
>
> source_id = group->count_cells[cell].s;
>
> - if (source_id != last_source_id) {
> + if (!pretty_print || (source_id != last_source_id)) {
> last_source_id = source_id;
> sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
> nr_targets = 0;
>
>
> best,
> /Peter Ljunglöf
>
> ______________________________________________________________
> __________________
> peter ljunglöf, språkbanken, göteborgs universitet
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
More information about the CWB
mailing list