[CWB] Problem with corpora on CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Nov 20 16:20:35 CET 2015


run the following command:

       locate CEQL.pm

you will probably find that it is in a different Perl path to the ones listed in your @INC. That is, it will be under /usr/local/share/perl/5.1x.x instead of /usr/local/share/perl/5.20.2

This problem arises when the version of Perl you are running now (5.20) is newer than the one you were running when you installed the CWB perl modules. (probably 5.16 or 5.18).

Solutions:


1.    Quick and dirty: copy the CWB perl modules across from the 5.16/5.18 path to equivalent locales under the 5.20 path

2.    Proper: download CWB-perl again and reinstall from scratch.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Katia Karanasiou
Sent: 20 November 2015 15:15
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Problem with corpora on CQPweb

Hello,
Thank you very much for your help.
I changed the permissions and now it creates the page for the corpus queries.
When i start a query at a specific corpus, it throws the following errors:

Base class package "CWB::CEQL" is empty.
(Perhaps you need to 'use' the module which defines that package first,
 or make that module available in @INC (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl .).
at ../lib/perl/cqpwebCEQL.pm line 27.
BEGIN failed--compilation aborted at ../lib/perl/cqpwebCEQL.pm line 27
Compilation failed in require at - line 2.
I've already installed Perl-CWB and i changed @INC to find the specific Perl module ( using export PERL5LIB=/var/www/CQPweb-3.2.1/lib/perl/cqpwebCEQL.pm ).
The CQPweb version is 3.2.1 and i installed the Perl-CWB-2.2.102 .

Any idea what the problem could be?
Thank you in advance.

Best regards,
Katia.



On Thu, Nov 19, 2015 at 3:39 PM, <cwb-request at sslmit.unibo.it<mailto:cwb-request at sslmit.unibo.it>> wrote:
Send CWB mailing list submissions to
        cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>

To subscribe or unsubscribe via the World Wide Web, visit
        http://devel.sslmit.unibo.it/mailman/listinfo/cwb
or, via email, send a message with subject or body 'help' to
        cwb-request at sslmit.unibo.it<mailto:cwb-request at sslmit.unibo.it>

You can reach the person managing the list at
        cwb-owner at sslmit.unibo.it<mailto:cwb-owner at sslmit.unibo.it>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of CWB digest..."


Today's Topics:

   1. Problem with corpora on CQPweb (Katia Karanasiou)
   2. Re: Problem with corpora on CQPweb (Hardie, Andrew)
   3. Re: Problem with corpora on CQPweb (Hannah Kermes)
   4. Re: TEITOK (Maarten Janssen)


----------------------------------------------------------------------

Message: 1
Date: Thu, 19 Nov 2015 13:20:20 +0100
From: Katia Karanasiou <katia.kar6 at gmail.com<mailto:katia.kar6 at gmail.com>>
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: [CWB] Problem with corpora on CQPweb
Message-ID:
        <CAN8HmPAK+miztjKA3CfsGo=yEWGiDPiHWkL_zZivZvFSpQ6SNA at mail.gmail.com<mailto:yEWGiDPiHWkL_zZivZvFSpQ6SNA at mail.gmail.com>>
Content-Type: text/plain; charset="utf-8"

Hello,

I used "CQPweb Admin Control Panel" -> "Install new Corpus" option for
uploading a new corpus to CQPweb. Although, it encodes the input corpus and
creates index files, it does not appear the corpora on cqp web site.
Does anyone know, which could be the problem?
Thank you.

Best regards,
Katia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151119/61c6a01b/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 19 Nov 2015 13:04:00 +0000
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>
To: Open source development of the Corpus WorkBench
        <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: Re: [CWB] Problem with corpora on CQPweb
Message-ID:
        <28078EC3FBF1B940A3EF3D0D19BE351D70C9A27F at EX-0-MB1.lancs.local<mailto:28078EC3FBF1B940A3EF3D0D19BE351D70C9A27F at EX-0-MB1.lancs.local>>
Content-Type: text/plain; charset="utf-8"

Have you checked whether the username that the web server runs under has permissions to create folders and symlinks in the main folder of CQPweb?

best

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Katia Karanasiou
Sent: 19 November 2015 12:20
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: [CWB] Problem with corpora on CQPweb

Hello,
I used "CQPweb Admin Control Panel" -> "Install new Corpus" option for uploading a new corpus to CQPweb. Although, it encodes the input corpus and creates index files, it does not appear the corpora on cqp web site.
Does anyone know, which could be the problem?
Thank you.
Best regards,
Katia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151119/fc8cfb54/attachment-0001.html>

------------------------------

Message: 3
Date: Thu, 19 Nov 2015 14:49:06 +0100
From: Hannah Kermes <h.kermes at mx.uni-saarland.de<mailto:h.kermes at mx.uni-saarland.de>>
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: Re: [CWB] Problem with corpora on CQPweb
Message-ID: <564DD352.5040906 at mx.uni-saarland.de<mailto:564DD352.5040906 at mx.uni-saarland.de>>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"

I once forgot to set permissions or to make it visible.

Best
Hannah

Am 19.11.2015 um 14:04 schrieb Hardie, Andrew:
>
> Have you checked whether the username that the web server runs under
> has permissions to create folders and symlinks in the main folder of
> CQPweb?
>
> best
>
> Andrew.
>
> *From:*cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>
> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] *On Behalf Of *Katia Karanasiou
> *Sent:* 19 November 2015 12:20
> *To:* cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
> *Subject:* [CWB] Problem with corpora on CQPweb
>
> Hello,
>
> I used "CQPweb Admin Control Panel" -> "Install new Corpus" option for
> uploading a new corpus to CQPweb. Although, it encodes the input
> corpus and creates index files, it does not appear the corpora on cqp
> web site.
>
> Does anyone know, which could be the problem?
>
> Thank you.
>
> Best regards,
>
> Katia
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151119/e00f5848/attachment-0001.html>

------------------------------

Message: 4
Date: Thu, 19 Nov 2015 15:39:40 +0100
From: Maarten Janssen <maartenpt at gmail.com<mailto:maartenpt at gmail.com>>
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: Re: [CWB] TEITOK
Message-ID: <EF9EC9F5-81F9-4650-868D-786E68E0CDE6 at gmail.com<mailto:EF9EC9F5-81F9-4650-868D-786E68E0CDE6 at gmail.com>>
Content-Type: text/plain; charset=utf-8

Hi Stefan and Andrew,

thanks for the answers! Here are some responses:

> TEITOK looks like an excellent tool ? can we put a link to the server on the CWB homepage?

Of course you can; I would be pleased if you did - the people that are using it seem quite pleased with it, so there is definitely a ?market? for it.

> Also, having a mostly automated TEI converter program would be really useful.

TEITOK is not really a TEI converter, and depending on what you want to convert you have to follow a different path:

- The internal structure TEITOK uses it uses is not really TEI, although it is TEI compliant; there are too many options in TEI to really work with it directly, and what is specifically not used is the P4+ style <w> elements where annotation is modeled as text-nodes under child nodes. Instead, it uses the ?older? style of <w> where annotations are attributes (to make sure they are always strings), and calls them <tok> rather than <w> to avoid confusion (and since <w> typically excludes punctuation marks, while tokens do not). So to use TEITOK, you either have to start from a TEI file that is not tokenized (TEITOK has an XML tokenizer to create TEITOK-style tokenize TEI), or convert the TEI file to TEITOK style (in Ljubljana they wrote an XSLT that does excatly that), after which tt-cwb-encode will directly create a CQP corpus for you.

- tt-cwb-encode can be used to direclty convert most TEI flavours to a CQP corpus (I should provide an  example settings file with it to show how to convert a typical <w> style TEI file to CQP), but tt-cwb-encode does not tokenize, so for doing that, you would need a file that IS already tokenized (and annotated), and specify exactly which information can be found where in your TEI file.

>>>> - the technical manual quite explicitly states that structures cannot embed or overlap; however, the logic of .rng files does not seem to invalidate that in any way.
>>
>> *Different* attributes can embed and overlap. But instances of one attribute can't embed with, or overlap with, other instances of the same attribute. And yes, it is not the structure of the binary files but rather the way they are used that prevents that.
>
> Well, the unpublished file format specification ? which I assume you mean by the "logic of .rng files" ? mandates that regions don't nest or overlap: the integer values in a .rng file must form an increasing sequence.  If you violate the file format, bad things will happen (i.e. undefined behaviour of CQP and the other CWB tools).

I by now fully implemented it and I can confirm that that is indeed a hard requirement: if you created two overlapping ranges, one from tokens 4-6 with error_type=?agreement? and one from 5-7 with error_type=?collocation? (generated in the example I tried from stand-off annotation files where ranges can overlap), then only token 7 will be a ?collocation? error, while 4-6 are only ?agreement? errors. However, at least from simple tests, it does not in any way seem to break CWB - it just ignores any token inside a range <x> that was already inside another range <x>.

>> For that reason, TEITOK since this week uses a custom c++ application to directly build the files needed by cwb-makeall from the XML files.
>
> Does that mean you actually create the binary data files (in uncompressed form) from your application, without going through the appropriate CWB tools?  You shouldn't do that, and I can't think of any good reason for doing it.[*]  One of the obvious consequences is that any file format changes ? such as those envisioned for CWB 4, will completely break your program, and it will be much harder to adapt than if you were using the CWB encoder tools.
>
> If you create .rng files through with the appropriate cwb-s-encode utility, it will stop you from generating overlapping or nested regions.
>
> [*] Ok, there's one fairly good reason if you're dealing with very large corpora: it may be more efficient to write files directly than to open pipes to a large number of cwb-encode and cwb-s-encode backends.  But I'm really not sure that this makes up for the loss in maintainability and reliability.

Yes - tt-cwb-encode directly writes binary files; I initially wanted to use cwb-atoi (and later hence cwb-s-encode), but when opening up the code in that, I saw the conversion is so trivial that there was simply not need for the overhead (which would also involve making sure the application can be found, etc.). It is a simple function, which can easily be modified to a call to cwb-atoi on a major overhaul, or just implemented slighly differently (a direct copy would not really word since tt-cwb-encode is C++ and not C)

// Write CWB network style
void write_network_number ( int towrite, FILE *stream ) {
        int i = htonl(towrite);
        fwrite(&i, 4, 1, stream);
};

The same holds for ranges, although that is of course vaguely more complicated. However, most of the work is in finding out what range to write in the first place, the 10 lines for
void write_range ( int pos1, int pos2, string formkey )
do not really add to the complexity and can also be modified in the future when needed.

Also - I would hope that if CWB gets a major overhaul, the implementation of ranges could be rethought as well, which would probably mean that even cwb-s-encode would break. Here is a "suggestion?:

Apart from allowing overlaps and/or nestings, the application of sattributes is hampered by the fact that they are so very different from pattributes, which means many of the nice functions on pattributes are not applicable to sattributes (I think even regex is not available for sattributes). In my opinion, the language would become much more expressive by blurring the distinction between p and s, and adopting a notation ala XPath where before the brackets you can indicate the range type (with nothing meaning a token), to allow for queries like

np[case=?nominative|ergative"] [pos=?V.*?]

and since these are ranges, they can of course nested:

mwe[type=?name? [pos=?CC"]]

which seems not only more elegant to me than [pos=?CC?] :: mwe_type=?name? but also should be more expressive...

The difference with the current search style is not that big (and it should not affect backward compatibility), and since a new file format would require looking up data compeltely differently anyway, it might be worth while to profit from that to treat sattributes more like pattributes?. in the current set-up they are very similar behind the screens: the lexicon.idx file is largely the same as the .avx file and the .lexicon mimicks the .avs file, the only real difference being that of course .corpus indicates positions and .rng ranges. However, internally they are treated very differently, and there is no range-based variant of .rvs for instance. But from the looks of it, there is little preventing sattributes from being treated mostly like pattributes - and of course, there would be major implications when you would try to implement that in the current CWB, but when making dramatic changes anyway, would it not be possible to look into that?

>>>> - ideally, the CQP tokens would direclty point to indexes in the XML files to make it possible to efficiently extract the matching data directly from the XML files. An inelegant method would be to add two pattributes for this, but would there be any more elegant way to link tokens in CQP to ranges in external files?
>>
>> Not any that I can think of.
>
> Nor I.  But that's not surprising, given that XML itself doesn't have an elegant way of linking to external files and is forced to use XPointers or other verbose and horrible concoctions.
>
> You could store XML IDs of the relevant elements as p-attributes, or byte offsets into the XML files (for better efficiency and flexibility). None of these solutions is efficient in CWB 3 ? they'll be much better in CWB 4 with "raw string" and "integer" attribute types.

Keeping the IDs is what TEITOK (and CorpusWiki) have done from the start, and is why results from CQL queries link directly to their result in the XML file; however, when showing long lists of results, it would be very nice to be able to show the initial XML context rather than the CQP output, since CQP does not do spacing, not does it do typesetting. And every implementation I tried (including writing a dedicated app) still ends up being to slow for internet use: a list of 100 results takes several seconds to load, which is not acceptable. So what I was/am looking for is indeed a way to store byte-offsets. But I?ll just either put these in a CQP pattribute then or in an external index (potentially using the CWB format for coherence).




------------------------------

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


End of CWB Digest, Vol 106, Issue 18
************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151120/c0d6cf54/attachment-0001.html>


More information about the CWB mailing list