[CWB] CQP output with PrintMode SGML
Peter Ljunglöf
peter.ljunglof at gu.se
Sat Aug 14 08:15:03 CEST 2010
Hi developers,
I have started to write a Python wrapper to CQP, and it seems to me that the best KWIC output format is SGML. Then I can extract the KWIC information into a Python object.
However, there are two (or three) problems with SGML output which makes it difficult:
1. The attribute separator is "/", which is a problem when the word or an attribute contains "/", e.g.:
CORPUS> show +lemma;
CORPUS> "1/2";
<CONCORDANCE>
<attribute type=positional name="word" anr=0>
<attribute type=positional name="lemma" anr=1>
<LINE>(...)<CONTENT>(...) <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
(...)
</CONCORDANCE>
As you can see, it's impossible to extract the attributes from the SGML. My suggestion is to use "<ATTR>" as the attribute separator instead, which will work since "<" and ">" are SGML escaped.
2. When using an aligned corpus, the SGML in the aligned text is escaped:
CORPUS_SWE> show +corpus_nld -lemma;
CORPUS_SWE> "veranda";
<CONCORDANCE>
<attribute type=positional name="word" anr=0>
<LINE>(...)<CONTENT> (...) <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE>
<align name="saltnld_nld"><CONTENT> (...) <TOKEN>veranda</TOKEN> (...) <TOKEN>.</TOKEN> </CONTENT>
(...)
</CONCORDANCE>
My suggestion is of course that the aligned text should not be escaped. ALso, that an "</align>" be printed in the end.
3. A smaller problem (and not a bug at all), is that the rows in the group output are contextual:
CORPUS> X = "de" [];
CORPUS> group X matchend lemma by match pos cut 50;
<TABLE>
<TR><TD>DT<TD>__UNDEF__<TD>152</TR>
<TR><TD>PN<TD>vara<TD>146</TR>
<TR><TD> <TD>ha<TD>117</TR>
<TR><TD> <TD>skola<TD>100</TR>
<TR><TD> <TD>inte<TD>89</TR>
<TR><TD> <TD>komma<TD>80</TR>
<TR><TD>DT<TD>mången<TD>71</TR>
<TR><TD> <TD>där<TD>61</TR>
<TR><TD> <TD>andra,annan,två<TD>52</TR>
</TABLE>
The 3rd row 1st column contains " ", which is a way of saying "the same as above". This is okay for ascii output and HTML output, but SGML is designed for computer readability, so personally I think that it shouldn't refer to earlier rows. Similar to "PrettyPrint off", which only works for "PrintMode ascii"...
My suggestion is that the group printer only prints if PrettyPrint is on.
4. I did some digging in the source code, and it was pretty easy to do the necessary changes. (Kudos to the programmers for making the code readable). Only 4 lines are affected, here's a diff:
Index: sgml-print.c
===================================================================
--- sgml-print.c (revision 182)
+++ sgml-print.c (working copy)
@@ -77,7 +77,7 @@
"<TOKEN>", /* BeforeToken */
" ", /* TokenSeparator */
- "/", /* AttributeSeparator */
+ "<ATTR>", /* AttributeSeparator */
"</TOKEN>", /* AfterToken */
"<CONTENT>", /* BeforeField */
@@ -213,7 +213,8 @@
sgml_puts(stream, "<align name=\"", 0);
sgml_puts(stream, attribute_name, 0);
sgml_puts(stream, "\">", 0);
- sgml_puts(stream, line, SUBST_ALL);
+ sgml_puts(stream, line, 0);
+ sgml_puts(stream, "</align>", 0);
fputc('\n', stream);
}
@@ -431,7 +432,7 @@
source_id = group->count_cells[cell].s;
- if (source_id != last_source_id) {
+ if (!pretty_print || (source_id != last_source_id)) {
last_source_id = source_id;
sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
nr_targets = 0;
best,
/Peter Ljunglöf
________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet
More information about the CWB
mailing list