[CWB] Need help importing CONLL-U files into CWB
Hardie, Andrew
a.hardie at lancaster.ac.uk
Wed Jan 11 11:56:42 CET 2023
Hi Bruce,
(advance general note: I believe Bruce's request is the last of the messages to this list that were on my email backlog, but if anyone else has asked a question that my catchup operation has overlooked, do feel free to ask again.)
There are a few different things going on here.
First, the input file that you attached to your query is not in the correct format for either CWB or ConLL-U, because the columns are not TAB-delimited. Instead they are delimited with spaces. Probably this is due to a setting in your text editor to save tabs as spaces. Nothing will work with this file, because the whole line will be encoded as a single column (the "word" column) and all the other columns treated as empty strings.
Second, indexing with this command
>> cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos -P xpos -P feats -P head -P deprel -P deps -P misc
won't work because no instruction has been included to treat the first column as an ID number. So the first column will be indexed as "word" which is not correct for your purposes.
In short if you have an ID number column, you must use either -n or -N.
The reason encoding failed for you is that using -n or -N is effectively a "promise" to cwb-encode that the 1st column will contain only digits 0-9 - but due to the spaces having replaced tabs (as noted above), the whole line is read as a single column that doesn't meet that requirement. Thus the error message on the first token line.
I hope this helps, although I realise that (to put it mildly) a reply 6 months later probably is not what you had hoped for.
best
Andrew.
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Bruce McKee
Sent: 11 July 2022 14:19
To: cwb at sslmit.unibo.it
Subject: [CWB] Need help importing CONLL-U files into CWB
Hello CWB experts;
We would like to bring CONLL-U formatted corpora into Corpus Workbench v3.4.33, running under Ubuntu 20.04.4 LTS. The CONLL-U file is an excerpt from a Stanford STANZA<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstanfordnlp.github.io%2Fstanza%2F&data=05%7C01%7Chardiea%40live.lancs.ac.uk%7Cda236f47ad774164a00308da63401d8c%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637931425026606996%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7ME%2BDk7m0ALOzbX0I5rD8F2zsjHh4%2FPkmlZZhAiSUlE%3D&reserved=0>-processed corpus.
We succeeded in encoding & indexing a small sample corpus test.conllu, but our cqp searches are not finding words. See the details below and the attached test.conllu file.
We also noticed that with cwb-encode, the -N id option triggers the following error (as does replacing it with just -n):
Invalid input line [1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj _ start_char=0|end_char=1], encoding aborted
[location of error: file test.conllu, line #4]
Thoughts on how we could resolve these problems?
Thanks!
--
Bruce McKee
Research Systems Consultant
System Administrator for the Phonetics & Computational Linguistics Lab
Department of Linguistics, Cornell University
--------------------------------------------------------------------------------------------------------------------------------------------------------
====================================
Our Encoding and indexing commands
====================================
DATA=/home/smith/cwb/data/test
REGISTRY=/home/smith/cwb/registry
INDEX=/home/smith/cwb/registry/test
mkdir $DATA
cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos -P xpos -P feats -P head -P deprel -P deps -P misc
cwb-make -r $REGISTRY -V TEST
====================================
Corpus Description command
====================================
cwb-describe-corpus -r $REGISTRY TEST
============================================================
Corpus: TEST
============================================================
description:
registry file: /home/smith/cwb/registry/test
home directory: /home/smith/cwb/data/test/
info file: /home/smith/cwb/data/test/.info
encoding: ascii
size (tokens): 69
9 positional attributes:
word lemma upos xpos
feats head deprel deps
misc
1 structural attributes:
s
0 alignment attributes:
===============================================================
cqp word searches (default cqp startup commands are in the .cqprc file)
===============================================================
$ cqp -e
System corpora:
E: EXAMPLE
T: TEST
[no corpus]> TEST;
TEST> info;
Size: 69
Charset: ascii
Properties:
language = '??'
charset = 'ascii'
No further information available about TEST
TEST> show cd;
===Context Descriptor=======================================
left context: 25 characters
right context: 25 characters
corpus position: shown
target anchors: not shown
Positional Attributes: * word
lemma
upos
xpos
feats
head
deprel
deps
misc
Structural Attributes: s
Aligned Corpora: <none>
============================================================
TEST> "same"
0 matches.
TEST> "getting"
0 matches.
TEST> [ lemma="the" ];
0 matches.
TEST>
===========================================================
Test file test.conllu (also attached to this e-mail)
===========================================================
# newdoc id = pcc_eng_test_1.0001_x00002
# sent_id = pcc_eng_test_1.0001_x00002_1
# text = I'm getting about the same thing trying to update "tf" (team fortress 2) on Ubuntu 7.10 (just updated it yesterday).
1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj _ start_char=0|end_char=1
2 'm be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 3 aux _ start_char=1|end_char=3
3 getting get VERB VBG Tense=Pres|VerbForm=Part 0 root _ start_char=4|end_char=11
4 about about ADV RB _ 7 advmod _ start_char=12|end_char=17
5 the the DET DT Definite=Def|PronType=Art 7 det _ start_char=18|end_char=21
6 same same ADJ JJ Degree=Pos 7 amod _ start_char=22|end_char=26
7 thing thing NOUN NN Number=Sing 3 obj _ start_char=27|end_char=32
8 trying try VERB VBG VerbForm=Ger 7 acl _ start_char=33|end_char=39
9 to to PART TO _ 10 mark _ start_char=40|end_char=42
10 update update VERB VB VerbForm=Inf 8 xcomp _ start_char=43|end_char=49
11 " " PUNCT `` _ 12 punct _ start_char=50|end_char=51
12 tf tf NOUN NN Number=Sing 10 obj _ start_char=51|end_char=53
13 " " PUNCT '' _ 12 punct _ start_char=53|end_char=54
14 ( ( PUNCT -LRB- _ 16 punct _ start_char=55|end_char=56
15 team team NOUN NN Number=Sing 16 compound _ start_char=56|end_char=60
16 fortress fortress NOUN NN Number=Sing 12 appos _ start_char=61|end_char=69
17 2 2 NUM LS NumType=Card 16 nummod _ start_char=70|end_char=71
18 ) ) PUNCT -RRB- _ 16 punct _ start_char=71|end_char=72
19 on on ADP IN _ 20 case _ start_char=73|end_char=75
20 Ubuntu Ubuntu PROPN NNP Number=Sing 10 obl _ start_char=76|end_char=82
21 7.10 7.10 NUM CD NumType=Card 20 nummod _ start_char=83|end_char=87
22 ( ( PUNCT -LRB- _ 24 punct _ start_char=88|end_char=89
23 just just ADV RB _ 24 advmod _ start_char=89|end_char=93
24 updated update VERB VBD Tense=Past|VerbForm=Part 3 parataxis _ start_char=94|end_char=101
25 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 24 obj _ start_char=102|end_char=104
26 yesterday yesterday NOUN NN Number=Sing 24 obl:tmod _ start_char=105|end_char=114
27 ) ) PUNCT -RRB- _ 24 punct _ start_char=114|end_char=115
28 . . PUNCT . _ 3 punct _ start_char=115|end_char=116
# sent_id = pcc_eng_test_1.0001_x00002_2
# text = DSL connection near Seattle, WA.
1 DSL dsl NOUN NN Number=Sing 2 compound _ start_char=117|end_char=120
2 connection connection NOUN NN Number=Sing 0 root _ start_char=121|end_char=131
3 near near ADP IN _ 4 case _ start_char=132|end_char=136
4 Seattle Seattle PROPN NNP Number=Sing 2 nmod _ start_char=137|end_char=144
5 , , PUNCT , _ 4 punct _ start_char=144|end_char=145
6 WA WA PROPN NNP Number=Sing 4 appos _ start_char=146|end_char=148
7 . . PUNCT . _ 2 punct _ start_char=148|end_char=149
# sent_id = pcc_eng_test_1.0001_x00002_3
# text = Come to think of it, might have been "Connection Closed", I'll have to check when I'm home in 10 hours.
1 Come come VERB VB Mood=Imp|VerbForm=Fin 0 root _ start_char=150|end_char=154
2 to to PART TO _ 3 mark _ start_char=155|end_char=157
3 think think VERB VB VerbForm=Inf 1 xcomp _ start_char=158|end_char=163
4 of of ADP IN _ 5 case _ start_char=164|end_char=166
5 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 obl _ start_char=167|end_char=169
6 , , PUNCT , _ 11 punct _ start_char=169|end_char=170
7 might might AUX MD VerbForm=Fin 11 aux _ start_char=171|end_char=176
8 have have AUX VB VerbForm=Inf 11 aux _ start_char=177|end_char=181
9 been be AUX VBN Tense=Past|VerbForm=Part 11 cop _ start_char=182|end_char=186
10 " " PUNCT `` _ 11 punct _ start_char=187|end_char=188
11 Connection connection NOUN NN Number=Sing 1 parataxis _ start_char=188|end_char=198
12 Closed close VERB VBN Tense=Past|VerbForm=Part 11 acl _ start_char=199|end_char=205
13 " " PUNCT '' _ 11 punct _ start_char=205|end_char=206
14 , , PUNCT , _ 1 punct _ start_char=206|end_char=207
15 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 17 nsubj _ start_char=208|end_char=209
16 'll will AUX MD VerbForm=Fin 17 aux _ start_char=209|end_char=212
17 have have VERB VB VerbForm=Inf 1 parataxis _ start_char=213|end_char=217
18 to to PART TO _ 19 mark _ start_char=218|end_char=220
19 check check VERB VB VerbForm=Inf 17 xcomp _ start_char=221|end_char=226
20 when when SCONJ WRB PronType=Int 23 mark _ start_char=227|end_char=231
21 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 23 nsubj _ start_char=232|end_char=233
22 'm be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 23 cop _ start_char=233|end_char=235
23 home home ADV RB _ 19 advcl _ start_char=236|end_char=240
24 in in ADP IN _ 26 case _ start_char=241|end_char=243
25 10 10 NUM CD NumType=Card 26 nummod _ start_char=244|end_char=246
26 hours hour NOUN NNS Number=Plur 23 obl _ start_char=247|end_char=252
27 . . PUNCT . _ 1 punct _ start_char=252|end_char=253
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230111/3e10ff47/attachment-0001.html>
More information about the CWB
mailing list