[CWB] Creating and importing Cyrillic corpus in CQPWeb

Wed Jan 18 17:55:46 CET 2017

Hi Nikolche,

OK, first, some notes on the steps you’ve taken so far.

·         First the Martinez tutorial – I didn’t actually know this existed! If I remember, I will write to the author at some point to ask if I can borrow bits of the text for the official CQPweb manual. It’s an excellent introductory guide.

·         Second, I have a reasonably complete set of TreeTagger parameter sets, but none for Macedonian as it is not available via Schmid’s site, so I cannot attempt to diagnose the problems you have been having with it, sorry! Also, I’m not familiar myself with MorphAdorner.

·         Third, while the Multext East lexicon will give you the language resources to build into a POS tagger, I am not sure if it comes with software to generate POS tagged / lemmatised output, or what output format it generates if it does…

With that out of the way: the critical point for CQPweb indexing is that the data must be in the correct input format.

The general CWB input format is described on pg 2 of the encoding tutorial: http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf

The additional requirements for CQPweb are described in my paper on the matter in IJCL – see here<http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004> (canonical link) or here<http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf> (open link) – especially the example on pg 390.

Basically you need to get your text into the correct columnar format with one word per line, with the raw token from the text in col 1, and other annotations (tag, lemma etc.) delimited by tags. Then you need to make sure that texts have the correct <text id="ID_CODE"> tags before them and </text> at the end. (With the XML tags on separate lines).

All other XML is optional.

Some taggers will produce the correct columnar format (TreeTagger does) but they may not guarantee the correct <text> tags. Other taggers will require you to manipulate their output into columnar format.

For your first experiment in indexing, can I recommend that you try indexing a file just with a single “word” column, and make sure that works properly before going on to more complex formats with tags, lemmas, etc? To create such a file it is merely necessary to get every word onto a separate line, with no whitespace except the line delimiters (in Unix format!). You can do this effectively with regular expression global search and replace.

Once you have a proper input file, you should be able to follow the instructions in the “simple” method of indexing (as specified in the tutorial you referenced<http://chozelinek.github.io/sacoco/cqpwebsetup.html>)  and get your corpus up and running. For a words-only corpus with no XML other than <text id=”…”>…</text>, you can leave the S-attribute and P-attribute specification forms empty.

Hope this helps, but feel free to ask the list again if you have further questions, and either I or another reader will answer!

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikolche Mickoski
Sent: 16 January 2017 17:34
To: cwb at sslmit.unibo.it
Subject: [CWB] Creating and importing Cyrillic corpus in CQPWeb

Hello,

I’m trying to create a corpus in the CQPWeb for Macedonian language and I would like to ask for your help.

I’ve installed CQPWeb in a box (Esmeralda). I tried to follow CQPweb Admin Manual, CWB Encoding Tutorial and Martínez tutorial (http://chozelinek.github.io/sacoco/cqpwebsetup.html) but in vain. I tried to annotate the corpus with TreeTagger but I failed. I was able to parse into sentences small texts with MorphAdorner but I still don’t know how I can use them with CQPWeb.

I obtained MULTEXT-East non-commercial lexicon for Macedonian (https://www.clarin.si/repository/xmlui/handle/11356/1042) containing over 1 million tagged lemmas. I’ve extracted Macedonian dump file of Wikipedia from dumps.wikimedia.org with Wikipedia Extractor. I did all the preparatory work, but I wasn’t able to create the corpus in CQPWeb.

After I tried everything I could get my hands on, I decided to write to you and ask for your help. I really hope that you can spare some time to help me with this.

Thank you very much,
Nikolche

Nikolche Mickoski
Translator/Interpreter
GSM +389 70 357 406
nmickoski at gmail.com<mailto:nmickoski at gmail.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170118/8b3e9507/attachment-0001.html>