<div dir="ltr"><div>Hi Luigi,</div><div><br></div><div>IMHO, anything so specific such as CoNLL-U conventions for sentence metadata should be clearly distinguished from a &quot;general&quot; CoNLL importer/converter. If you just need sentence ids, a simple workaround would be to adopt the CoNLL-2015 approach and to add sentence id to every word as a separate column to a CoNLL(-U) file (cf. trial data under <a href="https://www.cs.brandeis.edu/~clp/conll15st/dataset.html">https://www.cs.brandeis.edu/~clp/conll15st/dataset.html</a>, second column). <br></div><div><br></div><div>Other than that, a possible strategy for CoNLL formats in general (and compliant with formats that encode intersential relations as sentence offsets) would be to have *implicit* numerical sentence ids (i.e., number of preceding sentences [as in CoNLL-2015] or number of preceding sentences + 1 [~ word IDs]). That could actually be a feature of a generic CoNLL importer, as this is not specific to a particular CoNLL dialect.</div><div><br></div><div>Best,</div><div>Christian<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Am Di., 13. Apr. 2021 um 15:11 Uhr schrieb Luigi Talamo &lt;<a href="mailto:talamo.luigi@gmail.com">talamo.luigi@gmail.com</a>&gt;:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear all,<br>

I have recently used the new CoNLL feature and worked like a charm.<br>

Using the latest version of CWB through a Docker image, I was able to<br>

encode a conllu file with the following command:<br>

<br>

cwb-encode -f greek-conllu-file -d /var/corpora/el_ciep/ -c utf8 -R<br>

/usr/local/share/cwb/registry/el_ciep -xsB -N id -L s -P lemma -P upos<br>

-P xpos -P feats -P head -P deprel -P deps -P misc<br>

<br>

(It is the Greek treebank of the multilingual and parallel corpus we<br>

are currently building here at Saarland University)<br>

<br>

I already know from Stefan&#39;s answers that #lines are ignored, but it<br>

would be nice to have  at least the sentence id encoded - btw, how is<br>

the sentence boundary recognized, if there are not XML tags in the<br>

conllu file?<br>

<br>

cheers,<br>

Luigi<br>

<br>

On Wed, Mar 3, 2021 at 10:19 PM Stefan Evert &lt;<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>&gt; wrote:<br>

&gt;<br>

&gt; Dear Christian and Maarten,<br>

&gt;<br>

&gt; thanks for your clarification questions, which made me realise that my announcement had obviously been misleading.  By CoNLL support I meant that CWB is able to read and write the general CoNLL-style format – i.e. TAB-separated token-level annotation with numeric IDs in the first column and sentences separated by blank lines – not that it directly supports any particular CoNLL flavours.<br>

&gt;<br>

&gt; CWB has always focused on maximal flexibility and it would go against this principle to fix the interpretation of specific columns.  It should be easy enough to write a small shell script or bash functions with suitable presets for different CoNLL formats.<br>

&gt;<br>

&gt; Unfortunately I&#39;ve never been able to find formal documentation for a general CoNLL format (and neither for e.g. CoNLL-U), so it&#39;s quite possible that I&#39;ve overlooked some features, but would then hope to add them to cwb-encode.<br>

&gt;<br>

&gt;<br>

&gt; Regarding your specific questions:<br>

&gt;<br>

&gt; &gt; - Does that include the support of CoNLL-U metadata (in &quot;classical&quot; CoNLL, this is just skipped as a free-text comment, see <a href="https://universaldependencies.org/format.html#sentence-boundaries-and-comments" rel="noreferrer" target="_blank">https://universaldependencies.org/format.html#sentence-boundaries-and-comments</a> and <a href="https://universaldependencies.org/ext-format.html" rel="noreferrer" target="_blank">https://universaldependencies.org/ext-format.html</a>)<br>

&gt;<br>

&gt; These are just comment lines, and cwb-encode will ignore them – cf. the top section of <a href="https://universaldependencies.org/format.html" rel="noreferrer" target="_blank">https://universaldependencies.org/format.html</a>, which clearly says that there are only token lines, blank lines and comments.<br>

&gt;<br>

&gt; Further down, the remark &quot;the contents of the comments and metadata is basically unrestricted&quot; clarifies that it is impossible to index these comments in a meaningful way. :-)<br>

&gt;<br>

&gt; &gt; - Is this metadata/comment information preserved in (i.e., writable from) CWB?<br>

&gt;<br>

&gt; No, in this case pre-processing will be required to turn these lines into appropriate XML tags (which CoNLL should have done in the first place!).<br>

&gt;<br>

&gt; &gt; - Does that support the CoNLL-U encoding of multi-tokens (after lines with regular numerical IDs, say 1 and 2, you can add a multi-token line with ID 1-2 that describes the multi-token, see <a href="https://universaldependencies.org/format.html#words-tokens-and-empty-nodes" rel="noreferrer" target="_blank">https://universaldependencies.org/format.html#words-tokens-and-empty-nodes</a>)<br>

&gt;<br>

&gt; That doesn&#39;t fit into the CWB data model.  Actually, such input files will be rejected by cwb-encode because it requires the first column to be a number.<br>

&gt;<br>

&gt; &gt; - I assume that CoNLL formats with SRL annotations aren&#39;t supported (because they come with a variable number of columns, potentially different for every sentence). This does include CoNLL-2004 and CoNLL-2005 formats (among others), as well as the current PropBank &quot;skel&quot; format (which differs from the CoNLL SRL formats by replacing words with placeholders).<br>

&gt;<br>

&gt; The columns have to be the same for the entire corpus, of course.  I don&#39;t think changing around columns arbitrarily would give a reliable input format.<br>

&gt;<br>

&gt; Missing fields at the end of a line are simply indexed as __UNDEF__ by cwb-encode (without warnings).<br>

&gt;<br>

&gt; &gt; - Do you support the IOB(ES) formats for writing chunks (or are they just interpreted as strings)? These have been part of various CoNLL formats since 1999 and are still commonly used for chunking and named entity annotation.<br>

&gt;<br>

&gt; They are read as a positional attribute in IOB notation, of course.  I would convert them to chunks (if desired) after indexing, with something like<br>

&gt;<br>

&gt;         cqpcl -D CORPUS &#39;A = (?longest) [iob = &quot;B&quot;] [iob = &quot;I&quot;]+; tabulate A match, matchend;&#39; | cwb-s-encode -d data_dir -S chunk<br>

&gt;<br>

&gt; &gt; - Is there any support for PTB-style bracket formats (or are they just interpreted as strings)? They have been used for phrase-structure parsing and semantic role labelling in different CoNLL formats (and are, again, still part of the current PropBank &quot;skel&quot; format).<br>

&gt;<br>

&gt; They _are_ strings in the CoNLL format and are indexed as such.  In my view, CoNLL encodes neither chunks, nor phrase structure, nor dependency graphs – just text columns which can later be reinterpreted as such data structures.<br>

&gt;<br>

&gt; &gt; - Do you require TAB as column separator (as in most more recent CoNLL formats) or do you permit SPACE (as in the original CoNLL format) ? If the former, do you permit SPACE in tokens or annotations (traditionally, CoNLL formats don&#39;t, but with TAB-separated values, that is technically possible to occur)?<br>

&gt;<br>

&gt; CWB only accepts TAB-separated columns, so it&#39;s technically possible to have spaces in annotation values (but very much frowned upon).<br>

&gt;<br>

&gt; &gt; - Is there a strategy for escaping special characters, e.g., SPACE or TAB? Almost all CoNLL formats are TSV (i.e., CSV with TABs as separators), but I&#39;m not sure whether any of them uses the standard CSV conventions for this purpose -- partially, because they pre-date the CSV specification (<a href="https://tools.ietf.org/html/rfc4180" rel="noreferrer" target="_blank">https://tools.ietf.org/html/rfc4180</a>).<br>

&gt;<br>

&gt; Is TSV a well-defined format, i.e. a variant of CSV?<br>

&gt;<br>

&gt; But the purpose of TSV is that one doesn&#39;t have to mess around with quoted fields, so TABs and newlines can&#39;t be embedded in fields.  CWB also doesn&#39;t allow TABs or newlines in annotation values.<br>

&gt;<br>

&gt; &gt; Simply because of inherent limitations of the CWB3 data model, the answer to some of these questions is fairly obvious (and how to overcome them with CWB4, as well) -- so, apologies for asking explicitly --, but as &quot;support for CoNLL format&quot; can mean a lot of different things to potential users (depending on what data they&#39;re most familiar with), I would ask for documenting that as part of appendix B and also refer to that in the manual when introducing &quot;CoNLL format&quot; as a term).<br>

&gt;<br>

&gt; And CWB 4 will require much better (i.e. more explicit) input formats than CoNLL.<br>

&gt;<br>

&gt; But thanks for the recommendations, I&#39;ll try to remember until I have time to work on the manual again.  I think I explained my understanding of &quot;CoNLL-style format&quot; in the manpages, but I completely agree that &quot;full CoNLL support&quot; in the encoding tutorial is misleading.<br>

&gt;<br>

&gt; &gt; - In understand it takes empty lines as sentences, but does it also do doc and s attributes? And what does it use for the pattributes for the columns? (TEITOK uses what the standard describes: form, upos, xpos, feats, deprel, deps, head, and misc)<br>

&gt;<br>

&gt; As explained above, the attribute names for the columns have to be declared by the user running cwb-encode.  Since there is no structural annotation in CoNLL – only comment lines with poorly specified metadata format – the comments are ignored.<br>

&gt;<br>

&gt; &gt; - TEITOK also uses &lt;s&gt; (since it comes from TEI) but in UD they use sent - what was the motivation behind &lt;s&gt;? (I am trying to find out from the UD community whether &lt;s&gt; would be acceptable)<br>

&gt;<br>

&gt; cwb-encode actually still reads the .vrt input format, just with a few modifications to make it more CoNLL-friendly.  So you can encode XML tags in the normal way.<br>

&gt;<br>

&gt; &gt; - Is there also a CoNLL-U export, and if so, does that require anything special in the compiled corpus?<br>

&gt;<br>

&gt; CWB provides a full round-trip in the sense that if you encode all the columns as p-attributes and blank lines as sentence breaks, you can re-construct the input file with cwb-encode, except for the comment lines.<br>

&gt;<br>

&gt; Best &amp; thanks for your responses,<br>

&gt; Stefan<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; CWB mailing list<br>

&gt; <a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>

&gt; <a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" rel="noreferrer" target="_blank">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><br>

_______________________________________________<br>

CWB mailing list<br>

<a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>

<a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" rel="noreferrer" target="_blank">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><br>

</blockquote></div>