[CWB] CQP output with PrintMode SGML

Thu Aug 19 13:04:04 CEST 2010

Hi Yannick,

thanks, I didn't know about your wrapper, I'll definitely look into it someday.

Also I didn't know about Jørg Asmussen's CQP wrapper - it's similar to my approach, but I have some higher-level methods for extracting concordances. 

best, 
/Peter

19 aug 2010 kl. 12.02 skrev Yannick Versley:

> Hi Peter,
> 
> just as an aside: did you try the Python wrapper at
> http://bitbucket.org/yannick/cwb-python/overview
> this talks directly to the low-level CWB code, without any pipe in-between,
> and would allow you to directly retrieve tokens.
> This means that you have to do some work yourself (i.e., getting
> the offset information from the CQP search subprocess and deciding
> what amount of context to display) but it's more flexible overall.
> 
> Best,
> Yannick
> 
> On Thu, Aug 19, 2010 at 11:41 AM, Peter Ljunglöf <peter.ljunglof at gu.se> wrote:
> Hi Andrew (and others),
> 
> 15 aug 2010 kl. 12.36 skrev Hardie, Andrew:
> 
> > Thanks for the patch. You're right, the current situation looks very messy and needs to change; I didn't touch the print modules at all in the recent updates to CQP other than to check for Windows incompatibilities. I know that Stefan wanted to do a major overhaul of ALL the print-output modes (if I recall correctly, the plan is cutting them back to two - "plain text" and SGML)
> 
> Sounds like a good idea to me. And make SGML output XML-compatible.
> 
> Also, if you make ALL commands output SGML/XML (if that is the current print-mode), you can get rid of the PrettyPrint flag. It will be much easier to write a new front-end to CWB then.
> 
> > and there is certainly lots of ugliness in the current SGML setup - not just the use of / as a separator, but also the use of HTML-style tags for tables, and the fact that the SGML is not XML-compatible though it easily could be by adding a few end-tags and quotes around att-vals. But the attribute-divider change can be done on its own prior to major fiddling. I'll add it to the todo list!
> 
> I'd be happy if you could do this, since my Python wrapper depends on these (or similar) fixes.
> 
> /Peter
> 
> 
> >> -----Original Message-----
> >> From: cwb-bounces at sslmit.unibo.it
> >> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Peter Ljunglöf
> >> Sent: 14 August 2010 07:15
> >> To: cwb at sslmit.unibo.it
> >> Subject: [CWB] CQP output with PrintMode SGML
> >>
> >> Hi developers,
> >>
> >> I have started to write a Python wrapper to CQP, and it seems
> >> to me that the best KWIC output format is SGML. Then I can
> >> extract the KWIC information into a Python object.
> >>
> >> However, there are two (or three) problems with SGML output
> >> which makes it difficult:
> >>
> >> 1. The attribute separator is "/", which is a problem when
> >> the word or an attribute contains "/", e.g.:
> >>
> >> CORPUS> show +lemma;
> >> CORPUS> "1/2";
> >> <CONCORDANCE>
> >> <attribute type=positional name="word" anr=0> <attribute
> >> type=positional name="lemma" anr=1>
> >> <LINE>(...)<CONTENT>(...)
> >> <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE>
> >> (...)
> >> </CONCORDANCE>
> >>
> >> As you can see, it's impossible to extract the attributes
> >> from the SGML. My suggestion is to use "<ATTR>" as the
> >> attribute separator instead, which will work since "<" and
> >> ">" are SGML escaped.
> >>
> >> 2. When using an aligned corpus, the SGML in the aligned text
> >> is escaped:
> >>
> >> CORPUS_SWE> show +corpus_nld -lemma;
> >> CORPUS_SWE> "veranda";
> >> <CONCORDANCE>
> >> <attribute type=positional name="word" anr=0>
> >> <LINE>(...)<CONTENT> (...)
> >> <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE>
> >> <align name="saltnld_nld">&lt;CONTENT&gt; (...)
> >> &lt;TOKEN&gt;veranda&lt;/TOKEN&gt; (...)
> >> &lt;TOKEN&gt;.&lt;/TOKEN&gt; &lt;/CONTENT&gt;
> >> (...)
> >> </CONCORDANCE>
> >>
> >> My suggestion is of course that the aligned text should not
> >> be escaped. ALso, that an "</align>" be printed in the end.
> >>
> >> 3. A smaller problem (and not a bug at all), is that the rows
> >> in the group output are contextual:
> >>
> >> CORPUS> X = "de" [];
> >> CORPUS> group X matchend lemma by match pos cut 50;
> >> <TABLE>
> >> <TR><TD>DT<TD>__UNDEF__<TD>152</TR>
> >> <TR><TD>PN<TD>vara<TD>146</TR>
> >> <TR><TD>&nbsp;<TD>ha<TD>117</TR>
> >> <TR><TD>&nbsp;<TD>skola<TD>100</TR>
> >> <TR><TD>&nbsp;<TD>inte<TD>89</TR>
> >> <TR><TD>&nbsp;<TD>komma<TD>80</TR>
> >> <TR><TD>DT<TD>mången<TD>71</TR>
> >> <TR><TD>&nbsp;<TD>där<TD>61</TR>
> >> <TR><TD>&nbsp;<TD>andra,annan,två<TD>52</TR>
> >> </TABLE>
> >>
> >> The 3rd row 1st column contains "&nbsp;", which is a way of
> >> saying "the same as above". This is okay for ascii output and
> >> HTML output, but SGML is designed for computer readability,
> >> so personally I think that it shouldn't refer to earlier
> >> rows. Similar to "PrettyPrint off", which only works for
> >> "PrintMode ascii"...
> >>
> >> My suggestion is that the group printer only prints &nbsp; if
> >> PrettyPrint is on.
> >>
> >> 4. I did some digging in the source code, and it was pretty
> >> easy to do the necessary changes. (Kudos to the programmers
> >> for making the code readable). Only 4 lines are affected,
> >> here's a diff:
> >>
> >> Index: sgml-print.c
> >> ===================================================================
> >> --- sgml-print.c     (revision 182)
> >> +++ sgml-print.c     (working copy)
> >> @@ -77,7 +77,7 @@
> >>
> >>   "<TOKEN>",                    /* BeforeToken */
> >>   " ",                          /* TokenSeparator */
> >> -  "/",                          /* AttributeSeparator */
> >> +  "<ATTR>",                     /* AttributeSeparator */
> >>   "</TOKEN>",                   /* AfterToken */
> >>
> >>   "<CONTENT>",                  /* BeforeField */
> >> @@ -213,7 +213,8 @@
> >>   sgml_puts(stream, "<align name=\"", 0);
> >>   sgml_puts(stream, attribute_name, 0);
> >>   sgml_puts(stream, "\">", 0);
> >> -  sgml_puts(stream, line, SUBST_ALL);
> >> +  sgml_puts(stream, line, 0);
> >> +  sgml_puts(stream, "</align>", 0);
> >>
> >>   fputc('\n', stream);
> >> }
> >> @@ -431,7 +432,7 @@
> >>
> >>     source_id = group->count_cells[cell].s;
> >>
> >> -    if (source_id != last_source_id) {
> >> +    if (!pretty_print || (source_id != last_source_id)) {
> >>       last_source_id = source_id;
> >>       sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);
> >>       nr_targets = 0;
> >>
> >>
> >> best,
> >> /Peter Ljunglöf
> >>
> >> ______________________________________________________________
> >> __________________
> >> peter ljunglöf, språkbanken, göteborgs universitet
> >>
> >>
> >> _______________________________________________
> >> CWB mailing list
> >> CWB at sslmit.unibo.it
> >> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >>
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> ________________________________________________________________________________
> peter ljunglöf, språkbanken, göteborgs universitet
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet