[CWB] Creating and importing Cyrillic corpus in CQPWeb

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Feb 27 17:31:05 CET 2017


Hi Nikolche,

Sorry for the delay getting onto this.

I have attempted to reproduce the behaviour using a scratch duplicate of the Esmeralda VM image. Unfortunately, I've not been able to - everything seems to work fine, the http daemon is able to create the directories it needs when I test that separately, and the data indexes properly.

So, 3 things. First, could you re-send me that error log that you sent on Feb 7th? (off list). I seem to have lost the original attachment, sorry. 

Second. Could you also run the following commands and send along the output:

ls -al /var/cqpweb/index
ls -al /var/cqpweb/registry/


Third, a note to the crowd - has anyone else encountered this error - either with the VM image, or otherwise?

Thanks 

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 22 February 2017 18:53
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

Did you manage to check this?

Thank you,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Tuesday, February 07, 2017 10:50 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Then it's probably my fault! I will investigate.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:46
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

I'm using CQPWeb in a box (Esmeralda). I didn't change the settings.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Tuesday, February 07, 2017 10:45 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Does the username of your http daemon have the right file permissions to
create directories in /var/cqpweb/index? 

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:39
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi again,

I tried to index .vrt file via the CQPweb interface and I got the following
error message (I'm attaching the error log file):

CQPweb encountered an error and could not continue. cwb-encode reported an
error! Corpus indexing aborted.
cwb-encode -xsB -c utf8 -d /var/cqpweb/index/zzz -f /var/cqpweb/upload/test3
-R "/var/cqpweb/registry/zzz" -S text:0+id 2>&1 Error: data directory
'/var/cqpweb/index/zzz' does not exist. Please create this directory first.

I tried with several files, but I'm getting the same error.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Wednesday, January 25, 2017 5:54 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Nikolche

If I read the tutorial by JMMM correctly, the texts2corpus.py script he
supplies is purely for the purpose of merging multiple .vrt files into one.

So (a) you don't need to do this if you only have one file, you can just go
ahead and index; (b) even if you have more than one file, this can be
accomplished just as easily with Unix "cat".

(e.g. "cat folder-with-files/*.vrt > merged-input.vrt")

As for this error: " No execution mode was defined for this document type:
text/plain."

I really cannot comment on this one without more info. Can you tell me
EXACTLY what you did to get this error message? (full list of steps
including what you entered on the command line etc.)

And a final note: indexing via the CWB commandline programs vs. indexing via
the CQPweb interface is either/or: you don't need to do both.

Thanks

best

Andrew.

PS sorry for the slight delay in replying.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 22 January 2017 20:41
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

Thank you for the explanation, but unfortunately, I wasn't able to create
the corpus :(

I created a single column file in Unix format and inserted it in the test
folder, but nothing happens when I click texts2corpus.py. 

I also followed the Corpus Encoding Tutorial, but I got the following error:
No execution mode was defined for this document type: text/plain.

It looks like I need help for converting plain text file into single column
file with the required sentence tags which can be used with CWB.

Thank you,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of cwb-request at sslmit.unibo.it
Sent: Wednesday, January 18, 2017 5:56 PM
To: cwb at sslmit.unibo.it
Subject: CWB Digest, Vol 120, Issue 11

Send CWB mailing list submissions to
	cwb at sslmit.unibo.it

To subscribe or unsubscribe via the World Wide Web, visit
	http://liste.sslmit.unibo.it/mailman/listinfo/cwb
or, via email, send a message with subject or body 'help' to
	cwb-request at sslmit.unibo.it

You can reach the person managing the list at
	cwb-owner at sslmit.unibo.it

When replying, please edit your Subject line so it is more specific than
"Re: Contents of CWB digest..."


Today's Topics:

   1. Re: Creating and importing Cyrillic corpus in CQPWeb
      (Hardie, Andrew)


----------------------------------------------------------------------

Message: 1
Date: Wed, 18 Jan 2017 16:55:46 +0000
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: Open source development of the Corpus WorkBench
	<cwb at sslmit.unibo.it>
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb
Message-ID:
	<28078EC3FBF1B940A3EF3D0D19BE351D7FC07688 at EX-1-MB2.lancs.local>
Content-Type: text/plain; charset="utf-8"

Hi Nikolche,

OK, first, some notes on the steps you?ve taken so far.


?         First the Martinez tutorial ? I didn?t actually know this existed!
If I remember, I will write to the author at some point to ask if I can
borrow bits of the text for the official CQPweb manual. It?s an excellent
introductory guide.


?         Second, I have a reasonably complete set of TreeTagger parameter
sets, but none for Macedonian as it is not available via Schmid?s site, so I
cannot attempt to diagnose the problems you have been having with it, sorry!
Also, I?m not familiar myself with MorphAdorner.


?         Third, while the Multext East lexicon will give you the language
resources to build into a POS tagger, I am not sure if it comes with
software to generate POS tagged / lemmatised output, or what output format
it generates if it does?

With that out of the way: the critical point for CQPweb indexing is that the
data must be in the correct input format.

The general CWB input format is described on pg 2 of the encoding tutorial:
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf

The additional requirements for CQPweb are described in my paper on the
matter in IJCL ? see
here<http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/a
rt00004> (canonical link) or
here<http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf> (open link) ?
especially the example on pg 390.

Basically you need to get your text into the correct columnar format with
one word per line, with the raw token from the text in col 1, and other
annotations (tag, lemma etc.) delimited by tags. Then you need to make sure
that texts have the correct <text id="ID_CODE"> tags before them and </text>
at the end. (With the XML tags on separate lines).

All other XML is optional.

Some taggers will produce the correct columnar format (TreeTagger does) but
they may not guarantee the correct <text> tags. Other taggers will require
you to manipulate their output into columnar format.

For your first experiment in indexing, can I recommend that you try indexing
a file just with a single ?word? column, and make sure that works properly
before going on to more complex formats with tags, lemmas, etc? To create
such a file it is merely necessary to get every word onto a separate line,
with no whitespace except the line delimiters (in Unix format!). You can do
this effectively with regular expression global search and replace.

Once you have a proper input file, you should be able to follow the
instructions in the ?simple? method of indexing (as specified in the
tutorial you
referenced<http://chozelinek.github.io/sacoco/cqpwebsetup.html>)  and get
your corpus up and running. For a words-only corpus with no XML other than
<text id=???>?</text>, you can leave the S-attribute and P-attribute
specification forms empty.

Hope this helps, but feel free to ask the list again if you have further
questions, and either I or another reader will answer!

best

Andrew.




From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 16 January 2017 17:34
To: cwb at sslmit.unibo.it
Subject: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hello,

I?m trying to create a corpus in the CQPWeb for Macedonian language and I
would like to ask for your help.

I?ve installed CQPWeb in a box (Esmeralda). I tried to follow CQPweb Admin
Manual, CWB Encoding Tutorial and Mart?nez tutorial
(http://chozelinek.github.io/sacoco/cqpwebsetup.html) but in vain. I tried
to annotate the corpus with TreeTagger but I failed. I was able to parse
into sentences small texts with MorphAdorner but I still don?t know how I
can use them with CQPWeb.

I obtained MULTEXT-East non-commercial lexicon for Macedonian
(https://www.clarin.si/repository/xmlui/handle/11356/1042) containing over 1
million tagged lemmas. I?ve extracted Macedonian dump file of Wikipedia from
dumps.wikimedia.org with Wikipedia Extractor. I did all the preparatory
work, but I wasn?t able to create the corpus in CQPWeb.

After I tried everything I could get my hands on, I decided to write to you
and ask for your help. I really hope that you can spare some time to help me
with this.

Thank you very much,
Nikolche

Nikolche Mickoski
Translator/Interpreter
GSM +389 70 357 406
nmickoski at gmail.com<mailto:nmickoski at gmail.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170118/8b3e9507/at
tachment.html>

------------------------------

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


End of CWB Digest, Vol 120, Issue 11
************************************

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7996 / Virus Database: 4749/13739 - Release Date: 01/09/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4749/13831 - Release Date: 01/25/17
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/13868 - Release Date: 01/31/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/13868 - Release Date: 01/31/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list