[CWB] Creating and importing Cyrillic corpus in CQPWeb

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Feb 7 22:50:19 CET 2017


Then it's probably my fault! I will investigate.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:46
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

I'm using CQPWeb in a box (Esmeralda). I didn't change the settings.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: Tuesday, February 07, 2017 10:45 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Does the username of your http daemon have the right file permissions to create directories in /var/cqpweb/index? 

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 07 February 2017 21:39
To: 'Open source development of the Corpus WorkBench'
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi again,

I tried to index .vrt file via the CQPweb interface and I got the following error message (I'm attaching the error log file):

CQPweb encountered an error and could not continue. cwb-encode reported an error! Corpus indexing aborted.
cwb-encode -xsB -c utf8 -d /var/cqpweb/index/zzz -f /var/cqpweb/upload/test3 -R "/var/cqpweb/registry/zzz" -S text:0+id 2>&1 Error: data directory '/var/cqpweb/index/zzz' does not exist. Please create this directory first.

I tried with several files, but I'm getting the same error.

Best,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: Wednesday, January 25, 2017 5:54 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Nikolche

If I read the tutorial by JMMM correctly, the texts2corpus.py script he supplies is purely for the purpose of merging multiple .vrt files into one.

So (a) you don't need to do this if you only have one file, you can just go ahead and index; (b) even if you have more than one file, this can be accomplished just as easily with Unix "cat".

(e.g. "cat folder-with-files/*.vrt > merged-input.vrt")

As for this error: " No execution mode was defined for this document type:
text/plain."

I really cannot comment on this one without more info. Can you tell me EXACTLY what you did to get this error message? (full list of steps including what you entered on the command line etc.)

And a final note: indexing via the CWB commandline programs vs. indexing via the CQPweb interface is either/or: you don't need to do both.

Thanks

best

Andrew.

PS sorry for the slight delay in replying.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 22 January 2017 20:41
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hi Andrew,

Thank you for the explanation, but unfortunately, I wasn't able to create the corpus :(

I created a single column file in Unix format and inserted it in the test folder, but nothing happens when I click texts2corpus.py. 

I also followed the Corpus Encoding Tutorial, but I got the following error:
No execution mode was defined for this document type: text/plain.

It looks like I need help for converting plain text file into single column file with the required sentence tags which can be used with CWB.

Thank you,
Nikolche

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of cwb-request at sslmit.unibo.it
Sent: Wednesday, January 18, 2017 5:56 PM
To: cwb at sslmit.unibo.it
Subject: CWB Digest, Vol 120, Issue 11

Send CWB mailing list submissions to
	cwb at sslmit.unibo.it

To subscribe or unsubscribe via the World Wide Web, visit
	http://liste.sslmit.unibo.it/mailman/listinfo/cwb
or, via email, send a message with subject or body 'help' to
	cwb-request at sslmit.unibo.it

You can reach the person managing the list at
	cwb-owner at sslmit.unibo.it

When replying, please edit your Subject line so it is more specific than
"Re: Contents of CWB digest..."


Today's Topics:

   1. Re: Creating and importing Cyrillic corpus in CQPWeb
      (Hardie, Andrew)


----------------------------------------------------------------------

Message: 1
Date: Wed, 18 Jan 2017 16:55:46 +0000
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: Open source development of the Corpus WorkBench
	<cwb at sslmit.unibo.it>
Subject: Re: [CWB] Creating and importing Cyrillic corpus in CQPWeb
Message-ID:
	<28078EC3FBF1B940A3EF3D0D19BE351D7FC07688 at EX-1-MB2.lancs.local>
Content-Type: text/plain; charset="utf-8"

Hi Nikolche,

OK, first, some notes on the steps you?ve taken so far.


?         First the Martinez tutorial ? I didn?t actually know this existed!
If I remember, I will write to the author at some point to ask if I can borrow bits of the text for the official CQPweb manual. It?s an excellent introductory guide.


?         Second, I have a reasonably complete set of TreeTagger parameter
sets, but none for Macedonian as it is not available via Schmid?s site, so I cannot attempt to diagnose the problems you have been having with it, sorry!
Also, I?m not familiar myself with MorphAdorner.


?         Third, while the Multext East lexicon will give you the language
resources to build into a POS tagger, I am not sure if it comes with software to generate POS tagged / lemmatised output, or what output format it generates if it does?

With that out of the way: the critical point for CQPweb indexing is that the data must be in the correct input format.

The general CWB input format is described on pg 2 of the encoding tutorial:
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf

The additional requirements for CQPweb are described in my paper on the matter in IJCL ? see here<http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/a
rt00004> (canonical link) or
here<http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf> (open link) ?
especially the example on pg 390.

Basically you need to get your text into the correct columnar format with one word per line, with the raw token from the text in col 1, and other annotations (tag, lemma etc.) delimited by tags. Then you need to make sure that texts have the correct <text id="ID_CODE"> tags before them and </text> at the end. (With the XML tags on separate lines).

All other XML is optional.

Some taggers will produce the correct columnar format (TreeTagger does) but they may not guarantee the correct <text> tags. Other taggers will require you to manipulate their output into columnar format.

For your first experiment in indexing, can I recommend that you try indexing a file just with a single ?word? column, and make sure that works properly before going on to more complex formats with tags, lemmas, etc? To create such a file it is merely necessary to get every word onto a separate line, with no whitespace except the line delimiters (in Unix format!). You can do this effectively with regular expression global search and replace.

Once you have a proper input file, you should be able to follow the instructions in the ?simple? method of indexing (as specified in the tutorial you
referenced<http://chozelinek.github.io/sacoco/cqpwebsetup.html>)  and get your corpus up and running. For a words-only corpus with no XML other than <text id=???>?</text>, you can leave the S-attribute and P-attribute specification forms empty.

Hope this helps, but feel free to ask the list again if you have further questions, and either I or another reader will answer!

best

Andrew.




From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 16 January 2017 17:34
To: cwb at sslmit.unibo.it
Subject: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hello,

I?m trying to create a corpus in the CQPWeb for Macedonian language and I would like to ask for your help.

I?ve installed CQPWeb in a box (Esmeralda). I tried to follow CQPweb Admin Manual, CWB Encoding Tutorial and Mart?nez tutorial
(http://chozelinek.github.io/sacoco/cqpwebsetup.html) but in vain. I tried to annotate the corpus with TreeTagger but I failed. I was able to parse into sentences small texts with MorphAdorner but I still don?t know how I can use them with CQPWeb.

I obtained MULTEXT-East non-commercial lexicon for Macedonian
(https://www.clarin.si/repository/xmlui/handle/11356/1042) containing over 1 million tagged lemmas. I?ve extracted Macedonian dump file of Wikipedia from dumps.wikimedia.org with Wikipedia Extractor. I did all the preparatory work, but I wasn?t able to create the corpus in CQPWeb.

After I tried everything I could get my hands on, I decided to write to you and ask for your help. I really hope that you can spare some time to help me with this.

Thank you very much,
Nikolche

Nikolche Mickoski
Translator/Interpreter
GSM +389 70 357 406
nmickoski at gmail.com<mailto:nmickoski at gmail.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170118/8b3e9507/at
tachment.html>

------------------------------

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


End of CWB Digest, Vol 120, Issue 11
************************************

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7996 / Virus Database: 4749/13739 - Release Date: 01/09/17 Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4749/13831 - Release Date: 01/25/17 _______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7998 / Virus Database: 4756/13868 - Release Date: 01/31/17 Internal Virus Database is out of date.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list