[CWB] Creating and importing Cyrillic corpus in CQPWeb

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Feb 28 17:06:16 CET 2017


Yes, it does indeed look like that directory, /var/cqpweb/index, is missing, along with all the data for the previously-existing corpora (which you will probably now find can no longer be queried). And this would certainly explain the problem in installation (the system can't put data into a folder that doesn't exist).

That directory is definitely there on the original image however. 

Can I ask, had you previously created and then deleted a corpus on this copy of the VM? There was a bug reported a while back (see thread that ends here: http://liste.sslmit.unibo.it/pipermail/cwb/2016-December/002575.html ) where the entire "index" folder is scrubbed on deletion of a corpus. So, it looks like you might be having another instance of that. 

Unfortunately, in that linked discussion, I was not able to work out why that bug was kicking in: I could see WHERE it was going wrong (it looks to be a malformed argument in a call to recursive_delete_directory()within the install_new_corpus() function in the file admin-install.inc.php) but not HOW. I've checked the code in v 3.2.11 (the version in the VM) as well as the current v 3.2.26 and it's the same story in both. I've also done various things attempting to reproduice the bug, and I've not been able to: everything I do causes an abort rather than deleting the CWB index folder. I'm stumped!

Anyway, to restore your "index" to allow corpora to be installed, run these commands:

sudo mkdir /var/cqpweb/index
sudo chown www-data:www-data /var/cqpweb/index
sudo chmod 0775 /var/cqpweb/index

that should allow you index a corpus again. And you can prevent it from being deleted again by avoiding re-running corpus installation of  a corpus that you've previously indexed (if you re-index something, use a different corpus name)

And if anyone else has observed this (PRETTY DARN CRITICAL) bug and has any more detailed error reports, especially evidence from PHP warnings printed to an http daemon log (e.g. /var/log/apache/error.log) at the moment the index-folder deletion occurred, that would be very much appreciated.

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 27 February 2017 18:58
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

The output of the commands is the following:

user at CQPwebInABox:~$ ls - al /var/cqpweb/index
ls: cannot access -: No such file or directory
ls: cannot access al: No such file or directory
ls: cannot access /var/cqpweb/index: No such file or directory
user at CQPwebInABox:~$ ls - al /var/cqpweb/registry
ls: cannot access -: No such file or directory
ls: cannot access al: No such file or directory
/var/cqpweb/registry:
bncsampler  bncsampler__freq  lcmc  lcmc__freq
user at CQPwebInABox:~$ 

It looks like the "index" folder is missing. 

I will try to use the Esmeralda VM image on another computer and will tell
you if the problem occurs again.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Monday, February 27, 2017 5:31 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Nikolche,

Sorry for the delay getting onto this.

I have attempted to reproduce the behaviour using a scratch duplicate of the
Esmeralda VM image. Unfortunately, I've not been able to - everything seems
to work fine, the http daemon is able to create the directories it needs
when I test that separately, and the data indexes properly.

So, 3 things. First, could you re-send me that error log that you sent on
Feb 7th? (off list). I seem to have lost the original attachment, sorry. 

Second. Could you also run the following commands and send along the output:

ls -al /var/cqpweb/index
ls -al /var/cqpweb/registry/


Third, a note to the crowd - has anyone else encountered this error - either
with the VM image, or otherwise?

Thanks 

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 22 February 2017 18:53
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

Did you manage to check this?

Thank you,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Tuesday, February 07, 2017 10:50 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Then it's probably my fault! I will investigate.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:46
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

I'm using CQPWeb in a box (Esmeralda). I didn't change the settings.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Tuesday, February 07, 2017 10:45 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Does the username of your http daemon have the right file permissions to
create directories in /var/cqpweb/index? 

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:39
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi again,

I tried to index .vrt file via the CQPweb interface and I got the following
error message (I'm attaching the error log file):

CQPweb encountered an error and could not continue. cwb-encode reported an
error! Corpus indexing aborted.
cwb-encode -xsB -c utf8 -d /var/cqpweb/index/zzz -f /var/cqpweb/upload/test3
-R "/var/cqpweb/registry/zzz" -S text:0+id 2>&1 Error: data directory
'/var/cqpweb/index/zzz' does not exist. Please create this directory first.

I tried with several files, but I'm getting the same error.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Hardie, Andrew
Sent: Wednesday, January 25, 2017 5:54 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Nikolche

If I read the tutorial by JMMM correctly, the texts2corpus.py script he
supplies is purely for the purpose of merging multiple .vrt files into one.

So (a) you don't need to do this if you only have one file, you can just go
ahead and index; (b) even if you have more than one file, this can be
accomplished just as easily with Unix "cat".

(e.g. "cat folder-with-files/*.vrt > merged-input.vrt")

As for this error: " No execution mode was defined for this document type:
text/plain."

I really cannot comment on this one without more info. Can you tell me
EXACTLY what you did to get this error message? (full list of steps
including what you entered on the command line etc.)

And a final note: indexing via the CWB commandline programs vs. indexing via
the CQPweb interface is either/or: you don't need to do both.

Thanks

best

Andrew.

PS sorry for the slight delay in replying.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 22 January 2017 20:41
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

Thank you for the explanation, but unfortunately, I wasn't able to create
the corpus :(

I created a single column file in Unix format and inserted it in the test
folder, but nothing happens when I click texts2corpus.py. 

I also followed the Corpus Encoding Tutorial, but I got the following error:
No execution mode was defined for this document type: text/plain.

It looks like I need help for converting plain text file into single column
file with the required sentence tags which can be used with CWB.

Thank you,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of cwb-request at sslmit.unibo.it
Sent: Wednesday, January 18, 2017 5:56 PM
To: cwb at sslmit.unibo.it
Subject: CWB Digest, Vol 120, Issue 11

Send CWB mailing list submissions to
	cwb at sslmit.unibo.it

To subscribe or unsubscribe via the World Wide Web, visit
	http://liste.sslmit.unibo.it/mailman/listinfo/cwb
or, via email, send a message with subject or body 'help' to
	cwb-request at sslmit.unibo.it

You can reach the person managing the list at
	cwb-owner at sslmit.unibo.it

When replying, please edit your Subject line so it is more specific than
"Re: Contents of CWB digest..."


Today's Topics:

   1. Re: Creating and importing Cyrillic corpus in CQPWeb
      (Hardie, Andrew)


----------------------------------------------------------------------

Message: 1
Date: Wed, 18 Jan 2017 16:55:46 +0000
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: Open source development of the Corpus WorkBench
	<cwb at sslmit.unibo.it>
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb
Message-ID:
	<28078EC3FBF1B940A3EF3D0D19BE351D7FC07688 at EX-1-MB2.lancs.local>
Content-Type: text/plain; charset="utf-8"

Hi Nikolche,

OK, first, some notes on the steps you?ve taken so far.


?         First the Martinez tutorial ? I didn?t actually know this existed!
If I remember, I will write to the author at some point to ask if I can
borrow bits of the text for the official CQPweb manual. It?s an excellent
introductory guide.


?         Second, I have a reasonably complete set of TreeTagger parameter
sets, but none for Macedonian as it is not available via Schmid?s site, so I
cannot attempt to diagnose the problems you have been having with it, sorry!
Also, I?m not familiar myself with MorphAdorner.


?         Third, while the Multext East lexicon will give you the language
resources to build into a POS tagger, I am not sure if it comes with
software to generate POS tagged / lemmatised output, or what output format
it generates if it does?

With that out of the way: the critical point for CQPweb indexing is that the
data must be in the correct input format.

The general CWB input format is described on pg 2 of the encoding tutorial:
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf

The additional requirements for CQPweb are described in my paper on the
matter in IJCL ? see
here<http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/a
rt00004> (canonical link) or
here<http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf> (open link) ?
especially the example on pg 390.

Basically you need to get your text into the correct columnar format with
one word per line, with the raw token from the text in col 1, and other
annotations (tag, lemma etc.) delimited by tags. Then you need to make sure
that texts have the correct <text id="ID_CODE"> tags before them and </text>
at the end. (With the XML tags on separate lines).

All other XML is optional.

Some taggers will produce the correct columnar format (TreeTagger does) but
they may not guarantee the correct <text> tags. Other taggers will require
you to manipulate their output into columnar format.

For your first experiment in indexing, can I recommend that you try indexing
a file just with a single ?word? column, and make sure that works properly
before going on to more complex formats with tags, lemmas, etc? To create
such a file it is merely necessary to get every word onto a separate line,
with no whitespace except the line delimiters (in Unix format!). You can do
this effectively with regular expression global search and replace.

Once you have a proper input file, you should be able to follow the
instructions in the ?simple? method of indexing (as specified in the
tutorial you
referenced<http://chozelinek.github.io/sacoco/cqpwebsetup.html>)  and get
your corpus up and running. For a words-only corpus with no XML other than
<text id=???>?</text>, you can leave the S-attribute and P-attribute
specification forms empty.

Hope this helps, but feel free to ask the list again if you have further
questions, and either I or another reader will answer!

best

Andrew.




From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Nikolche Mickoski
Sent: 16 January 2017 17:34
To: cwb at sslmit.unibo.it
Subject: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hello,

I?m trying to create a corpus in the CQPWeb for Macedonian language and I
would like to ask for your help.

I?ve installed CQPWeb in a box (Esmeralda). I tried to follow CQPweb Admin
Manual, CWB Encoding Tutorial and Mart?nez tutorial
(http://chozelinek.github.io/sacoco/cqpwebsetup.html) but in vain. I tried
to annotate the corpus with TreeTagger but I failed. I was able to parse
into sentences small texts with MorphAdorner but I still don?t know how I
can use them with CQPWeb.

I obtained MULTEXT-East non-commercial lexicon for Macedonian
(https://www.clarin.si/repository/xmlui/handle/11356/1042) containing over 1
million tagged lemmas. I?ve extracted Macedonian dump file of Wikipedia from
dumps.wikimedia.org with Wikipedia Extractor. I did all the preparatory
work, but I wasn?t able to create the corpus in CQPWeb.

After I tried everything I could get my hands on, I decided to write to you
and ask for your help. I really hope that you can spare some time to help me
with this.

Thank you very much,
Nikolche

Nikolche Mickoski
Translator/Interpreter
GSM +389 70 357 406
nmickoski at gmail.com<mailto:nmickoski at gmail.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170118/8b3e9507/at
tachment.html>

------------------------------

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


End of CWB Digest, Vol 120, Issue 11
************************************

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7996 / Virus Database: 4749/13739 - Release Date: 01/09/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4749/13831 - Release Date: 01/25/17
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/13868 - Release Date: 01/31/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/13868 - Release Date: 01/31/17
Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/14015 - Release Date: 02/24/17

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list