Hi Peter,<div><br></div><div>just as an aside: did you try the Python wrapper at</div><div><a href="http://bitbucket.org/yannick/cwb-python/overview">http://bitbucket.org/yannick/cwb-python/overview</a></div><div><a href="http://bitbucket.org/yannick/cwb-python/overview"></a>this talks directly to the low-level CWB code, without any pipe in-between,</div>
<div>and would allow you to directly retrieve tokens.</div><div>This means that you have to do some work yourself (i.e., getting</div><div>the offset information from the CQP search subprocess and deciding</div><div>what amount of context to display) but it's more flexible overall.</div>
<div><br></div><div>Best,</div><div>Yannick<br><br><div class="gmail_quote">On Thu, Aug 19, 2010 at 11:41 AM, Peter Ljunglöf <span dir="ltr"><<a href="mailto:peter.ljunglof@gu.se">peter.ljunglof@gu.se</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Andrew (and others),<br>
<br>
15 aug 2010 kl. 12.36 skrev Hardie, Andrew:<br>
<div class="im"><br>
> Thanks for the patch. You're right, the current situation looks very messy and needs to change; I didn't touch the print modules at all in the recent updates to CQP other than to check for Windows incompatibilities. I know that Stefan wanted to do a major overhaul of ALL the print-output modes (if I recall correctly, the plan is cutting them back to two - "plain text" and SGML)<br>
<br>
</div>Sounds like a good idea to me. And make SGML output XML-compatible.<br>
<br>
Also, if you make ALL commands output SGML/XML (if that is the current print-mode), you can get rid of the PrettyPrint flag. It will be much easier to write a new front-end to CWB then.<br>
<div class="im"><br>
> and there is certainly lots of ugliness in the current SGML setup - not just the use of / as a separator, but also the use of HTML-style tags for tables, and the fact that the SGML is not XML-compatible though it easily could be by adding a few end-tags and quotes around att-vals. But the attribute-divider change can be done on its own prior to major fiddling. I'll add it to the todo list!<br>
<br>
</div>I'd be happy if you could do this, since my Python wrapper depends on these (or similar) fixes.<br>
<font color="#888888"><br>
/Peter<br>
</font><div><div></div><div class="h5"><br>
<br>
>> -----Original Message-----<br>
>> From: <a href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a><br>
>> [mailto:<a href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a>] On Behalf Of Peter Ljunglöf<br>
>> Sent: 14 August 2010 07:15<br>
>> To: <a href="mailto:cwb@sslmit.unibo.it">cwb@sslmit.unibo.it</a><br>
>> Subject: [CWB] CQP output with PrintMode SGML<br>
>><br>
>> Hi developers,<br>
>><br>
>> I have started to write a Python wrapper to CQP, and it seems<br>
>> to me that the best KWIC output format is SGML. Then I can<br>
>> extract the KWIC information into a Python object.<br>
>><br>
>> However, there are two (or three) problems with SGML output<br>
>> which makes it difficult:<br>
>><br>
>> 1. The attribute separator is "/", which is a problem when<br>
>> the word or an attribute contains "/", e.g.:<br>
>><br>
>> CORPUS> show +lemma;<br>
>> CORPUS> "1/2";<br>
>> <CONCORDANCE><br>
>> <attribute type=positional name="word" anr=0> <attribute<br>
>> type=positional name="lemma" anr=1><br>
>> <LINE>(...)<CONTENT>(...)<br>
>> <MATCH><TOKEN>1/2/1/2</TOKEN></MATCH> (...)</CONTENT></LINE><br>
>> (...)<br>
>> </CONCORDANCE><br>
>><br>
>> As you can see, it's impossible to extract the attributes<br>
>> from the SGML. My suggestion is to use "<ATTR>" as the<br>
>> attribute separator instead, which will work since "<" and<br>
>> ">" are SGML escaped.<br>
>><br>
>> 2. When using an aligned corpus, the SGML in the aligned text<br>
>> is escaped:<br>
>><br>
>> CORPUS_SWE> show +corpus_nld -lemma;<br>
>> CORPUS_SWE> "veranda";<br>
>> <CONCORDANCE><br>
>> <attribute type=positional name="word" anr=0><br>
>> <LINE>(...)<CONTENT> (...)<br>
>> <MATCH><TOKEN>veranda</TOKEN></MATCH> (...)</CONTENT></LINE><br>
>> <align name="saltnld_nld">&lt;CONTENT&gt; (...)<br>
>> &lt;TOKEN&gt;veranda&lt;/TOKEN&gt; (...)<br>
>> &lt;TOKEN&gt;.&lt;/TOKEN&gt; &lt;/CONTENT&gt;<br>
>> (...)<br>
>> </CONCORDANCE><br>
>><br>
>> My suggestion is of course that the aligned text should not<br>
>> be escaped. ALso, that an "</align>" be printed in the end.<br>
>><br>
>> 3. A smaller problem (and not a bug at all), is that the rows<br>
>> in the group output are contextual:<br>
>><br>
>> CORPUS> X = "de" [];<br>
>> CORPUS> group X matchend lemma by match pos cut 50;<br>
>> <TABLE><br>
>> <TR><TD>DT<TD>__UNDEF__<TD>152</TR><br>
>> <TR><TD>PN<TD>vara<TD>146</TR><br>
>> <TR><TD>&nbsp;<TD>ha<TD>117</TR><br>
>> <TR><TD>&nbsp;<TD>skola<TD>100</TR><br>
>> <TR><TD>&nbsp;<TD>inte<TD>89</TR><br>
>> <TR><TD>&nbsp;<TD>komma<TD>80</TR><br>
>> <TR><TD>DT<TD>mången<TD>71</TR><br>
>> <TR><TD>&nbsp;<TD>där<TD>61</TR><br>
>> <TR><TD>&nbsp;<TD>andra,annan,två<TD>52</TR><br>
>> </TABLE><br>
>><br>
>> The 3rd row 1st column contains "&nbsp;", which is a way of<br>
>> saying "the same as above". This is okay for ascii output and<br>
>> HTML output, but SGML is designed for computer readability,<br>
>> so personally I think that it shouldn't refer to earlier<br>
>> rows. Similar to "PrettyPrint off", which only works for<br>
>> "PrintMode ascii"...<br>
>><br>
>> My suggestion is that the group printer only prints &nbsp; if<br>
>> PrettyPrint is on.<br>
>><br>
>> 4. I did some digging in the source code, and it was pretty<br>
>> easy to do the necessary changes. (Kudos to the programmers<br>
>> for making the code readable). Only 4 lines are affected,<br>
>> here's a diff:<br>
>><br>
>> Index: sgml-print.c<br>
>> ===================================================================<br>
>> --- sgml-print.c (revision 182)<br>
>> +++ sgml-print.c (working copy)<br>
>> @@ -77,7 +77,7 @@<br>
>><br>
>> "<TOKEN>", /* BeforeToken */<br>
>> " ", /* TokenSeparator */<br>
>> - "/", /* AttributeSeparator */<br>
>> + "<ATTR>", /* AttributeSeparator */<br>
>> "</TOKEN>", /* AfterToken */<br>
>><br>
>> "<CONTENT>", /* BeforeField */<br>
>> @@ -213,7 +213,8 @@<br>
>> sgml_puts(stream, "<align name=\"", 0);<br>
>> sgml_puts(stream, attribute_name, 0);<br>
>> sgml_puts(stream, "\">", 0);<br>
>> - sgml_puts(stream, line, SUBST_ALL);<br>
>> + sgml_puts(stream, line, 0);<br>
>> + sgml_puts(stream, "</align>", 0);<br>
>><br>
>> fputc('\n', stream);<br>
>> }<br>
>> @@ -431,7 +432,7 @@<br>
>><br>
>> source_id = group->count_cells[cell].s;<br>
>><br>
>> - if (source_id != last_source_id) {<br>
>> + if (!pretty_print || (source_id != last_source_id)) {<br>
>> last_source_id = source_id;<br>
>> sgml_puts(fd, Group_id2str(group, source_id, 0), SUBST_ALL);<br>
>> nr_targets = 0;<br>
>><br>
>><br>
>> best,<br>
>> /Peter Ljunglöf<br>
>><br>
>> ______________________________________________________________<br>
>> __________________<br>
>> peter ljunglöf, språkbanken, göteborgs universitet<br>
>><br>
>><br>
>> _______________________________________________<br>
>> CWB mailing list<br>
>> <a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>
>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>><br>
> _______________________________________________<br>
> CWB mailing list<br>
> <a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>
> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
<br>
</div></div>________________________________________________________________________________<br>
<div><div></div><div class="h5">peter ljunglöf, språkbanken, göteborgs universitet<br>
<br>
<br>
_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>
<a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
</div></div></blockquote></div><br></div>