[CWB] CQP output with PrintMode SGML

Peter Ljunglöf peter.ljunglof at gu.se
Sat Aug 14 08:15:03 CEST 2010


Hi developers,

I have started to write a Python wrapper to CQP, and it seems to me that the best KWIC output format is SGML. Then I can extract the KWIC information into a Python object.

However, there are two (or three) problems with SGML output which makes it difficult: 

1. The attribute separator is "/", which is a problem when the word or an attribute contains "/", e.g.:

CORPUS> show +lemma;
CORPUS> "1/2";
<CONCORDANCE>
<attribute type=positional name="word" anr=0>
<attribute type=positional name="lemma" anr=1>
<LINE>(...)<CONTENT>(...) <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
(...)
</CONCORDANCE>

As you can see, it's impossible to extract the attributes from the SGML. My suggestion is to use "<ATTR>" as the attribute separator instead, which will work since "<" and ">" are SGML escaped.

2. When using an aligned corpus, the SGML in the aligned text is escaped:

CORPUS_SWE> show +corpus_nld -lemma;
CORPUS_SWE> "veranda";
<CONCORDANCE>
<attribute type=positional name="word" anr=0>
<LINE>(...)<CONTENT> (...) <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE>
<align name="saltnld_nld">&lt;CONTENT&gt; (...) &lt;TOKEN&gt;veranda&lt;/TOKEN&gt; (...) &lt;TOKEN&gt;.&lt;/TOKEN&gt; &lt;/CONTENT&gt;
(...)
</CONCORDANCE>

My suggestion is of course that the aligned text should not be escaped. ALso, that an "</align>" be printed in the end.

3. A smaller problem (and not a bug at all), is that the rows in the group output are contextual:

CORPUS> X = "de" [];
CORPUS> group X matchend lemma by match pos cut 50;
<TABLE>
<TR><TD>DT<TD>__UNDEF__<TD>152</TR>
<TR><TD>PN<TD>vara<TD>146</TR>
<TR><TD>&nbsp;<TD>ha<TD>117</TR>
<TR><TD>&nbsp;<TD>skola<TD>100</TR>
<TR><TD>&nbsp;<TD>inte<TD>89</TR>
<TR><TD>&nbsp;<TD>komma<TD>80</TR>
<TR><TD>DT<TD>mången<TD>71</TR>
<TR><TD>&nbsp;<TD>där<TD>61</TR>
<TR><TD>&nbsp;<TD>andra,annan,två<TD>52</TR>
</TABLE>

The 3rd row 1st column contains "&nbsp;", which is a way of saying "the same as above". This is okay for ascii output and HTML output, but SGML is designed for computer readability, so personally I think that it shouldn't refer to earlier rows. Similar to "PrettyPrint off", which only works for "PrintMode ascii"...

My suggestion is that the group printer only prints &nbsp; if PrettyPrint is on.

4. I did some digging in the source code, and it was pretty easy to do the necessary changes. (Kudos to the programmers for making the code readable). Only 4 lines are affected, here's a diff:

Index: sgml-print.c
===================================================================
--- sgml-print.c	(revision 182)
+++ sgml-print.c	(working copy)
@@ -77,7 +77,7 @@
 
   "<TOKEN>",                    /* BeforeToken */
   " ",                          /* TokenSeparator */
-  "/",                          /* AttributeSeparator */
+  "<ATTR>",                     /* AttributeSeparator */
   "</TOKEN>",                   /* AfterToken */
 
   "<CONTENT>",                  /* BeforeField */
@@ -213,7 +213,8 @@
   sgml_puts(stream, "<align name=\"", 0);
   sgml_puts(stream, attribute_name, 0);
   sgml_puts(stream, "\">", 0);
-  sgml_puts(stream, line, SUBST_ALL);
+  sgml_puts(stream, line, 0); 
+  sgml_puts(stream, "</align>", 0);
 
   fputc('\n', stream);
 }
@@ -431,7 +432,7 @@
 
     source_id = group->count_cells[cell].s;
     
-    if (source_id != last_source_id) {
+    if (!pretty_print || (source_id != last_source_id)) {
       last_source_id = source_id;
       sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
       nr_targets = 0;


best, 
/Peter Ljunglöf

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet




More information about the CWB mailing list