<div dir="ltr"><div class="gmail_quote"><div dir="ltr"><div>Hello CWB experts;<br></div><div><br></div><div>We would like to bring CONLL-U formatted corpora into Corpus Workbench v3.4.33, running under Ubuntu 20.04.4 LTS.  The CONLL-U file is an excerpt from a Stanford <a href="https://stanfordnlp.github.io/stanza/" target="_blank">STANZA</a>-processed corpus.</div><div><br></div><div>We succeeded in encoding &amp; indexing a small sample corpus test.conllu, but our cqp searches are not finding words.  See the details below and the attached test.conllu file.</div><div><br></div><div><font face="arial, sans-serif">We also noticed that with cwb-encode, <span style="color:rgb(23,43,77)">the -N id  option triggers the following error (as does replacing it with just -n):</span></font></div><p style="margin:10px 0px 0px 40px;padding:0px;color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,&quot;Segoe UI&quot;,Roboto,Oxygen,Ubuntu,&quot;Fira Sans&quot;,&quot;Droid Sans&quot;,&quot;Helvetica Neue&quot;,sans-serif;font-size:14px"><strong><code style="font-family:SFMono-Medium,&quot;SF Mono&quot;,&quot;Segoe UI Mono&quot;,&quot;Roboto Mono&quot;,&quot;Ubuntu Mono&quot;,Menlo,Courier,monospace">Invalid input line [1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _       start_char=0|end_char=1], encoding aborted<br>[location of error: file test.conllu, line #4]</code></strong></p><div><br></div><div>Thoughts on how we could resolve these problems? </div><div><br></div><div>Thanks!</div><div><br></div><div>--<br><div dir="ltr"><div dir="ltr">Bruce McKee<div>Research Systems Consultant</div><div>System Administrator for the Phonetics &amp; Computational Linguistics Lab</div><div>Department of Linguistics, Cornell University</div><div></div></div></div></div><div><br></div><div>--------------------------------------------------------------------------------------------------------------------------------------------------------</div><div><br></div><div><b>====================================</b></div><div><b>Our Encoding and indexing commands</b></div><div><b>====================================</b></div><div><br></div><div><font face="monospace">DATA=/home/smith/cwb/data/test<br>REGISTRY=/home/smith/cwb/registry<br>INDEX=/home/smith/cwb/registry/test<br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace"><span style="color:rgb(23,43,77)">mkdir $DATA</span><br></font></div><div><span style="color:rgb(23,43,77)"><font face="monospace"><br></font></span></div><div><font face="monospace">cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos -P xpos -P feats -P head -P deprel -P deps -P misc<br></font></div><div><span style="color:rgb(23,43,77)"><font face="monospace"><br></font></span></div><div><span style="color:rgb(23,43,77)"><font face="monospace">cwb-make -r $REGISTRY -V TEST</font></span></div><div><div><b><br></b></div><div><b>====================================</b></div><div><b>Corpus Description command</b></div><div><b>====================================</b></div></div><div><span style="color:rgb(23,43,77)"><font face="monospace"><br></font></span></div><div><font face="monospace">cwb-describe-corpus -r $REGISTRY TEST<br><br>============================================================<br>Corpus: TEST<br>============================================================<br><br>description:<br>registry file:  /home/smith/cwb/registry/test<br>home directory: /home/smith/cwb/data/test/<br>info file:      /home/smith/cwb/data/test/.info<br>encoding:       ascii<br>size (tokens):  69<br><br>  9 positional attributes:<br>      word            lemma           upos            xpos<br>      feats           head            deprel          deps<br>      misc<br><br>  1 structural attributes:<br>      s<br><br>  0 alignment  attributes:</font><br></div><div><font face="monospace"><br></font></div><div><div><b><br>===============================================================</b></div><div><b>cqp word searches (default cqp startup commands are in the .cqprc file)</b></div><div><b>===============================================================</b></div></div><div><font face="monospace"><br></font></div><div><div><font face="monospace">$ cqp -e<br>System corpora:<br> E: EXAMPLE<br> T: TEST<br>[no corpus]&gt; TEST;<br>TEST&gt; info;<br>Size:    69<br>Charset: ascii<br>Properties:<br>        language = &#39;??&#39;<br>        charset = &#39;ascii&#39;<br><br>No further information available about TEST<br>TEST&gt; show cd;<br>===Context Descriptor=======================================<br><br>left context:     25 characters<br>right context:    25 characters<br>corpus position:  shown<br>target anchors:   not shown<br><br>Positional Attributes:  * word<br>                          lemma<br>                          upos<br>                          xpos<br>                          feats<br>                          head<br>                          deprel<br>                          deps<br>                          misc<br><br>Structural Attributes:    s<br><br>Aligned Corpora:          &lt;none&gt;<br><br>============================================================<br>TEST&gt; &quot;same&quot;<br>0 matches.<br>TEST&gt; &quot;getting&quot;<br>0 matches.<br>TEST&gt; [ lemma=&quot;the&quot; ];<br>0 matches.<br>TEST&gt;</font></div></div><div><br></div><div><font face="monospace"><b>===========================================================</b></font></div><div><font face="monospace"><b>Test file test.conllu (also attached to this e-mail)</b></font></div><div><font face="monospace"><b>===========================================================</b></font></div><div><font face="monospace"><b><br></b></font></div><div><font face="monospace"># newdoc id = pcc_eng_test_1.0001_x00002<br># sent_id = pcc_eng_test_1.0001_x00002_1<br># text = I&#39;m getting about the same thing trying to update &quot;tf&quot; (team fortress 2) on Ubuntu 7.10 (just updated it yesterday).<br>1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _       start_char=0|end_char=1<br>2       &#39;m      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        3       aux     _       start_char=1|end_char=3<br>3       getting get     VERB    VBG     Tense=Pres|VerbForm=Part        0       root    _       start_char=4|end_char=11<br>4       about   about   ADV     RB      _       7       advmod  _       start_char=12|end_char=17<br>5       the     the     DET     DT      Definite=Def|PronType=Art       7       det     _       start_char=18|end_char=21<br>6       same    same    ADJ     JJ      Degree=Pos      7       amod    _       start_char=22|end_char=26<br>7       thing   thing   NOUN    NN      Number=Sing     3       obj     _       start_char=27|end_char=32<br>8       trying  try     VERB    VBG     VerbForm=Ger    7       acl     _       start_char=33|end_char=39<br>9       to      to      PART    TO      _       10      mark    _       start_char=40|end_char=42<br>10      update  update  VERB    VB      VerbForm=Inf    8       xcomp   _       start_char=43|end_char=49<br>11      &quot;       &quot;       PUNCT   ``      _       12      punct   _       start_char=50|end_char=51<br>12      tf      tf      NOUN    NN      Number=Sing     10      obj     _       start_char=51|end_char=53<br>13      &quot;       &quot;       PUNCT   &#39;&#39;      _       12      punct   _       start_char=53|end_char=54<br>14      (       (       PUNCT   -LRB-   _       16      punct   _       start_char=55|end_char=56<br>15      team    team    NOUN    NN      Number=Sing     16      compound        _       start_char=56|end_char=60<br>16      fortress        fortress        NOUN    NN      Number=Sing     12      appos   _       start_char=61|end_char=69<br>17      2       2       NUM     LS      NumType=Card    16      nummod  _       start_char=70|end_char=71<br>18      )       )       PUNCT   -RRB-   _       16      punct   _       start_char=71|end_char=72<br>19      on      on      ADP     IN      _       20      case    _       start_char=73|end_char=75<br>20      Ubuntu  Ubuntu  PROPN   NNP     Number=Sing     10      obl     _       start_char=76|end_char=82<br>21      7.10    7.10    NUM     CD      NumType=Card    20      nummod  _       start_char=83|end_char=87<br>22      (       (       PUNCT   -LRB-   _       24      punct   _       start_char=88|end_char=89<br>23      just    just    ADV     RB      _       24      advmod  _       start_char=89|end_char=93<br>24      updated update  VERB    VBD     Tense=Past|VerbForm=Part        3       parataxis       _       start_char=94|end_char=101<br>25      it      it      PRON    PRP     Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  24      obj     _       start_char=102|end_char=104<br>26      yesterday       yesterday       NOUN    NN      Number=Sing     24      obl:tmod        _       start_char=105|end_char=114<br>27      )       )       PUNCT   -RRB-   _       24      punct   _       start_char=114|end_char=115<br>28      .       .       PUNCT   .       _       3       punct   _       start_char=115|end_char=116<br> <br># sent_id = pcc_eng_test_1.0001_x00002_2<br># text = DSL connection near Seattle, WA.<br>1       DSL     dsl     NOUN    NN      Number=Sing     2       compound        _       start_char=117|end_char=120<br>2       connection      connection      NOUN    NN      Number=Sing     0       root    _       start_char=121|end_char=131<br>3       near    near    ADP     IN      _       4       case    _       start_char=132|end_char=136<br>4       Seattle Seattle PROPN   NNP     Number=Sing     2       nmod    _       start_char=137|end_char=144<br>5       ,       ,       PUNCT   ,       _       4       punct   _       start_char=144|end_char=145<br>6       WA      WA      PROPN   NNP     Number=Sing     4       appos   _       start_char=146|end_char=148<br>7       .       .       PUNCT   .       _       2       punct   _       start_char=148|end_char=149<br> <br> <br># sent_id = pcc_eng_test_1.0001_x00002_3<br># text = Come to think of it, might have been &quot;Connection Closed&quot;, I&#39;ll have to check when I&#39;m home in 10 hours.<br>1       Come    come    VERB    VB      Mood=Imp|VerbForm=Fin   0       root    _       start_char=150|end_char=154<br>2       to      to      PART    TO      _       3       mark    _       start_char=155|end_char=157<br>3       think   think   VERB    VB      VerbForm=Inf    1       xcomp   _       start_char=158|end_char=163<br>4       of      of      ADP     IN      _       5       case    _       start_char=164|end_char=166<br>5       it      it      PRON    PRP     Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3       obl     _       start_char=167|end_char=169<br>6       ,       ,       PUNCT   ,       _       11      punct   _       start_char=169|end_char=170<br>7       might   might   AUX     MD      VerbForm=Fin    11      aux     _       start_char=171|end_char=176<br>8       have    have    AUX     VB      VerbForm=Inf    11      aux     _       start_char=177|end_char=181<br>9       been    be      AUX     VBN     Tense=Past|VerbForm=Part        11      cop     _       start_char=182|end_char=186<br>10      &quot;       &quot;       PUNCT   ``      _       11      punct   _       start_char=187|end_char=188<br>11      Connection      connection      NOUN    NN      Number=Sing     1       parataxis       _       start_char=188|end_char=198<br>12      Closed  close   VERB    VBN     Tense=Past|VerbForm=Part        11      acl     _       start_char=199|end_char=205<br>13      &quot;       &quot;       PUNCT   &#39;&#39;      _       11      punct   _       start_char=205|end_char=206<br>14      ,       ,       PUNCT   ,       _       1       punct   _       start_char=206|end_char=207<br>15      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      17      nsubj   _       start_char=208|end_char=209<br>16      &#39;ll     will    AUX     MD      VerbForm=Fin    17      aux     _       start_char=209|end_char=212<br>17      have    have    VERB    VB      VerbForm=Inf    1       parataxis       _       start_char=213|end_char=217<br>18      to      to      PART    TO      _       19      mark    _       start_char=218|end_char=220<br>19      check   check   VERB    VB      VerbForm=Inf    17      xcomp   _       start_char=221|end_char=226<br>20      when    when    SCONJ   WRB     PronType=Int    23      mark    _       start_char=227|end_char=231<br>21      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      23      nsubj   _       start_char=232|end_char=233<br>22      &#39;m      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        23      cop     _       start_char=233|end_char=235<br>23      home    home    ADV     RB      _       19      advcl   _       start_char=236|end_char=240<br>24      in      in      ADP     IN      _       26      case    _       start_char=241|end_char=243<br>25      10      10      NUM     CD      NumType=Card    26      nummod  _       start_char=244|end_char=246<br>26      hours   hour    NOUN    NNS     Number=Plur     23      obl     _       start_char=247|end_char=252<br>27      .       .       PUNCT   .       _       1       punct   _       start_char=252|end_char=253</font><br></div><div><div><p style="margin:10px 0px 0px 80px;padding:0px;color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,&quot;Segoe UI&quot;,Roboto,Oxygen,Ubuntu,&quot;Fira Sans&quot;,&quot;Droid Sans&quot;,&quot;Helvetica Neue&quot;,sans-serif;font-size:14px"><br><br></p></div><div><br></div></div><div><p style="margin:10px 0px 0px 80px;padding:0px;color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,&quot;Segoe UI&quot;,Roboto,Oxygen,Ubuntu,&quot;Fira Sans&quot;,&quot;Droid Sans&quot;,&quot;Helvetica Neue&quot;,sans-serif;font-size:14px"><br><br></p></div><div><br></div><div dir="ltr"><div dir="ltr"><br></div></div></div></div></div>