[CWB] CQP output with PrintMode SGML

Peter Ljunglöf peter.ljunglof at gu.se
Thu Aug 19 11:41:53 CEST 2010


Hi Andrew (and others),

15 aug 2010 kl. 12.36 skrev Hardie, Andrew:

> Thanks for the patch. You're right, the current situation looks very messy and needs to change; I didn't touch the print modules at all in the recent updates to CQP other than to check for Windows incompatibilities. I know that Stefan wanted to do a major overhaul of ALL the print-output modes (if I recall correctly, the plan is cutting them back to two - "plain text" and SGML)

Sounds like a good idea to me. And make SGML output XML-compatible.

Also, if you make ALL commands output SGML/XML (if that is the current print-mode), you can get rid of the PrettyPrint flag. It will be much easier to write a new front-end to CWB then.

> and there is certainly lots of ugliness in the current SGML setup - not just the use of / as a separator, but also the use of HTML-style tags for tables, and the fact that the SGML is not XML-compatible though it easily could be by adding a few end-tags and quotes around att-vals. But the attribute-divider change can be done on its own prior to major fiddling. I'll add it to the todo list!

I'd be happy if you could do this, since my Python wrapper depends on these (or similar) fixes.

/Peter


>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Peter Ljunglöf
>> Sent: 14 August 2010 07:15
>> To: cwb at sslmit.unibo.it
>> Subject: [CWB] CQP output with PrintMode SGML
>> 
>> Hi developers,
>> 
>> I have started to write a Python wrapper to CQP, and it seems 
>> to me that the best KWIC output format is SGML. Then I can 
>> extract the KWIC information into a Python object.
>> 
>> However, there are two (or three) problems with SGML output 
>> which makes it difficult: 
>> 
>> 1. The attribute separator is "/", which is a problem when 
>> the word or an attribute contains "/", e.g.:
>> 
>> CORPUS> show +lemma;
>> CORPUS> "1/2";
>> <CONCORDANCE>
>> <attribute type=positional name="word" anr=0> <attribute 
>> type=positional name="lemma" anr=1>
>> <LINE>(...)<CONTENT>(...) 
>> <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
>> (...)
>> </CONCORDANCE>
>> 
>> As you can see, it's impossible to extract the attributes 
>> from the SGML. My suggestion is to use "<ATTR>" as the 
>> attribute separator instead, which will work since "<" and 
>> ">" are SGML escaped.
>> 
>> 2. When using an aligned corpus, the SGML in the aligned text 
>> is escaped:
>> 
>> CORPUS_SWE> show +corpus_nld -lemma;
>> CORPUS_SWE> "veranda";
>> <CONCORDANCE>
>> <attribute type=positional name="word" anr=0> 
>> <LINE>(...)<CONTENT> (...) 
>> <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE> 
>> <align name="saltnld_nld">&lt;CONTENT&gt; (...) 
>> &lt;TOKEN&gt;veranda&lt;/TOKEN&gt; (...) 
>> &lt;TOKEN&gt;.&lt;/TOKEN&gt; &lt;/CONTENT&gt;
>> (...)
>> </CONCORDANCE>
>> 
>> My suggestion is of course that the aligned text should not 
>> be escaped. ALso, that an "</align>" be printed in the end.
>> 
>> 3. A smaller problem (and not a bug at all), is that the rows 
>> in the group output are contextual:
>> 
>> CORPUS> X = "de" [];
>> CORPUS> group X matchend lemma by match pos cut 50;
>> <TABLE>
>> <TR><TD>DT<TD>__UNDEF__<TD>152</TR>
>> <TR><TD>PN<TD>vara<TD>146</TR>
>> <TR><TD>&nbsp;<TD>ha<TD>117</TR>
>> <TR><TD>&nbsp;<TD>skola<TD>100</TR>
>> <TR><TD>&nbsp;<TD>inte<TD>89</TR>
>> <TR><TD>&nbsp;<TD>komma<TD>80</TR>
>> <TR><TD>DT<TD>mången<TD>71</TR>
>> <TR><TD>&nbsp;<TD>där<TD>61</TR>
>> <TR><TD>&nbsp;<TD>andra,annan,två<TD>52</TR>
>> </TABLE>
>> 
>> The 3rd row 1st column contains "&nbsp;", which is a way of 
>> saying "the same as above". This is okay for ascii output and 
>> HTML output, but SGML is designed for computer readability, 
>> so personally I think that it shouldn't refer to earlier 
>> rows. Similar to "PrettyPrint off", which only works for 
>> "PrintMode ascii"...
>> 
>> My suggestion is that the group printer only prints &nbsp; if 
>> PrettyPrint is on.
>> 
>> 4. I did some digging in the source code, and it was pretty 
>> easy to do the necessary changes. (Kudos to the programmers 
>> for making the code readable). Only 4 lines are affected, 
>> here's a diff:
>> 
>> Index: sgml-print.c
>> ===================================================================
>> --- sgml-print.c	(revision 182)
>> +++ sgml-print.c	(working copy)
>> @@ -77,7 +77,7 @@
>> 
>>   "<TOKEN>",                    /* BeforeToken */
>>   " ",                          /* TokenSeparator */
>> -  "/",                          /* AttributeSeparator */
>> +  "<ATTR>",                     /* AttributeSeparator */
>>   "</TOKEN>",                   /* AfterToken */
>> 
>>   "<CONTENT>",                  /* BeforeField */
>> @@ -213,7 +213,8 @@
>>   sgml_puts(stream, "<align name=\"", 0);
>>   sgml_puts(stream, attribute_name, 0);
>>   sgml_puts(stream, "\">", 0);
>> -  sgml_puts(stream, line, SUBST_ALL);
>> +  sgml_puts(stream, line, 0);
>> +  sgml_puts(stream, "</align>", 0);
>> 
>>   fputc('\n', stream);
>> }
>> @@ -431,7 +432,7 @@
>> 
>>     source_id = group->count_cells[cell].s;
>> 
>> -    if (source_id != last_source_id) {
>> +    if (!pretty_print || (source_id != last_source_id)) {
>>       last_source_id = source_id;
>>       sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
>>       nr_targets = 0;
>> 
>> 
>> best,
>> /Peter Ljunglöf
>> 
>> ______________________________________________________________
>> __________________
>> peter ljunglöf, språkbanken, göteborgs universitet
>> 
>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet




More information about the CWB mailing list