[CWB] CQP output with PrintMode SGML

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Aug 15 12:36:10 CEST 2010


Hi Peter,

Thanks for the patch. You're right, the current situation looks very messy and needs to change; I didn't touch the print modules at all in the recent updates to CQP other than to check for Windows incompatibilities. I know that Stefan wanted to do a major overhaul of ALL the print-output modes (if I recall correctly, the plan is cutting them back to two - "plain text" and SGML) and there is certainly lots of ugliness in the current SGML setup - not just the use of / as a separator, but also the use of HTML-style tags for tables, and the fact that the SGML is not XML-compatible though it easily could be by adding a few end-tags and quotes around att-vals. But the attribute-divider change can be done on its own prior to major fiddling. I'll add it to the todo list!

Best

Andrew.

> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Peter Ljunglöf
> Sent: 14 August 2010 07:15
> To: cwb at sslmit.unibo.it
> Subject: [CWB] CQP output with PrintMode SGML
> 
> Hi developers,
> 
> I have started to write a Python wrapper to CQP, and it seems 
> to me that the best KWIC output format is SGML. Then I can 
> extract the KWIC information into a Python object.
> 
> However, there are two (or three) problems with SGML output 
> which makes it difficult: 
> 
> 1. The attribute separator is "/", which is a problem when 
> the word or an attribute contains "/", e.g.:
> 
> CORPUS> show +lemma;
> CORPUS> "1/2";
> <CONCORDANCE>
> <attribute type=positional name="word" anr=0> <attribute 
> type=positional name="lemma" anr=1>
> <LINE>(...)<CONTENT>(...) 
> <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
> (...)
> </CONCORDANCE>
> 
> As you can see, it's impossible to extract the attributes 
> from the SGML. My suggestion is to use "<ATTR>" as the 
> attribute separator instead, which will work since "<" and 
> ">" are SGML escaped.
> 
> 2. When using an aligned corpus, the SGML in the aligned text 
> is escaped:
> 
> CORPUS_SWE> show +corpus_nld -lemma;
> CORPUS_SWE> "veranda";
> <CONCORDANCE>
> <attribute type=positional name="word" anr=0> 
> <LINE>(...)<CONTENT> (...) 
> <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE> 
> <align name="saltnld_nld">&lt;CONTENT&gt; (...) 
> &lt;TOKEN&gt;veranda&lt;/TOKEN&gt; (...) 
> &lt;TOKEN&gt;.&lt;/TOKEN&gt; &lt;/CONTENT&gt;
> (...)
> </CONCORDANCE>
> 
> My suggestion is of course that the aligned text should not 
> be escaped. ALso, that an "</align>" be printed in the end.
> 
> 3. A smaller problem (and not a bug at all), is that the rows 
> in the group output are contextual:
> 
> CORPUS> X = "de" [];
> CORPUS> group X matchend lemma by match pos cut 50;
> <TABLE>
> <TR><TD>DT<TD>__UNDEF__<TD>152</TR>
> <TR><TD>PN<TD>vara<TD>146</TR>
> <TR><TD>&nbsp;<TD>ha<TD>117</TR>
> <TR><TD>&nbsp;<TD>skola<TD>100</TR>
> <TR><TD>&nbsp;<TD>inte<TD>89</TR>
> <TR><TD>&nbsp;<TD>komma<TD>80</TR>
> <TR><TD>DT<TD>mången<TD>71</TR>
> <TR><TD>&nbsp;<TD>där<TD>61</TR>
> <TR><TD>&nbsp;<TD>andra,annan,två<TD>52</TR>
> </TABLE>
> 
> The 3rd row 1st column contains "&nbsp;", which is a way of 
> saying "the same as above". This is okay for ascii output and 
> HTML output, but SGML is designed for computer readability, 
> so personally I think that it shouldn't refer to earlier 
> rows. Similar to "PrettyPrint off", which only works for 
> "PrintMode ascii"...
> 
> My suggestion is that the group printer only prints &nbsp; if 
> PrettyPrint is on.
> 
> 4. I did some digging in the source code, and it was pretty 
> easy to do the necessary changes. (Kudos to the programmers 
> for making the code readable). Only 4 lines are affected, 
> here's a diff:
> 
> Index: sgml-print.c
> ===================================================================
> --- sgml-print.c	(revision 182)
> +++ sgml-print.c	(working copy)
> @@ -77,7 +77,7 @@
>  
>    "<TOKEN>",                    /* BeforeToken */
>    " ",                          /* TokenSeparator */
> -  "/",                          /* AttributeSeparator */
> +  "<ATTR>",                     /* AttributeSeparator */
>    "</TOKEN>",                   /* AfterToken */
>  
>    "<CONTENT>",                  /* BeforeField */
> @@ -213,7 +213,8 @@
>    sgml_puts(stream, "<align name=\"", 0);
>    sgml_puts(stream, attribute_name, 0);
>    sgml_puts(stream, "\">", 0);
> -  sgml_puts(stream, line, SUBST_ALL);
> +  sgml_puts(stream, line, 0);
> +  sgml_puts(stream, "</align>", 0);
>  
>    fputc('\n', stream);
>  }
> @@ -431,7 +432,7 @@
>  
>      source_id = group->count_cells[cell].s;
>      
> -    if (source_id != last_source_id) {
> +    if (!pretty_print || (source_id != last_source_id)) {
>        last_source_id = source_id;
>        sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
>        nr_targets = 0;
> 
> 
> best,
> /Peter Ljunglöf
> 
> ______________________________________________________________
> __________________
> peter ljunglöf, språkbanken, göteborgs universitet
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 


More information about the CWB mailing list