[CWB] Escape "<" and ">" symbols

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Feb 23 16:00:10 CET 2018


Hi Mansur,

To supplement Stefan’s reply…

>>: This suggests that you have CQP installed, but in a "private" path that's only visible to your user account and not to the Web server running CQPweb.  You may also need to configure CQPweb and set appropriate paths there.

Specifically – set the configuration variable $path_to_cwb. See admin manual page 24. this tells CQPweb where to find the CQP executable.

On export corpus -

>> AFAIK, only users with the "full access privilege" are allowed to download a corpus.  So if you want to disable downloads, simply keep to "normal access".

This is correct. Manual p 81.

>>6) When I press 'Show frequency information' I get:
Error # 1146: Table 'cqpweb_db.freq_corpus_smi_word' doesn't exist
Do I need to generate it somehow manually?

If you have not set up the frequency list, you can’t view it! See the “Manage frequency lists” option.

But note that frequency list setup requires the CQP / CWB executables, ,so, it won’t work till you have fixed the executables problem.

>>8) What does mean all those 'Cannot be calculated'. What should I do to fix it?

The comments in brackets explain why each one cannot be calculated.

N of texts requires the text metadata to have been set up (either by adding metadata or creating a “minimalist” table.)
N of tokens is not set up until text metadata and frequency lists are generated.
N of types relies on the frequency lists,.
Type token ratio relies on the previous 2.

>> [Wed Feb 21 20:48:48.580421 2018] [php7:warn] [pid 5262:tid 139681043830528] [client 127.0.0.1:59340<http://127.0.0.1:59340>] PHP Warning:  chmod(): Operation not permitted in /var/www/htdocs/cqpweb/lib/admin-install.inc.php on line 605, referer: http://localhost/cqpweb/adm/index.php?thisF=installCorpusIndexed&uT=y

This suggests you have a permissions problem.  The system is trying to call chmod() but is not allowed to. Possibly, the username your web server runs under does not have the necessary permissions for the web directory. See manual p 12.

>> Wed Feb 21 20:50:04.431408 2018] [php7:warn] [pid 5262:tid 139679844263680] [client 127.0.0.1:59348<http://127.0.0.1:59348>] PHP Warning:  array_unshift() expects parameter 1 to be array, string given in /var/www/htdocs/cqpweb/lib/ceql.inc.php on line 260, referer: http://localhost/cqpweb/smi/index.php?thisQ=search&uT=y

This one is due to a bug, thanks for spotting it. I have fixed it.

>>2) After that I used the CQPweb, CQL search worked fine, but simple search didn't work:
>>Can't locate ../lib/perl/cqpwebCEQL.pm at - line 2.

My guess is that this is the same permissions issue. Your web server can’t locate the file EITHER because it does not  have the right permission for that file, or its containing directory; OR because an issue relating to file permissions has stopped it running in the right location – resulting in the relative address being incorrect.

The easiest way to fix permissions globally is to move ownership of the CQPweb folder and all its tree to the username your web server runs under,.

This relative path is a reference to the CQPweb internal code, so it should not need an entry in $perl_extra_directories (such entries should be used to locate non-CQPweb modules).

@ Stefan
>>         @Andrew: has this bug been fixed in the lastest CQPweb code?

Tags are, to the best of my knowledge,   html-escaped in the present code. HOWEVER, use of <…> may confuse the system with the code used to extract corpus XML for visualisation purposes.

@ Mansur
It really is best not to use < and > in tags!

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of mansur
Sent: 22 February 2018 09:55
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Escape "<" and ">" symbols

Hello, Stefan!
Thank you so much for the answers and advice! They clearified me many things.

> You may also need to configure CQPweb and set appropriate paths there.
Could you, please, explain how I can do that?
Thank you!
Best,
Mansur

On 22 February 2018 at 11:52, Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>> wrote:
Dear Mansur,

most of the remaining issues are related to CQPweb, so Andrew will be in a much better position to answer them and help you with the debugging.  Some of them are clearly (mis-)configuration issues, e.g. the failure to locate the CEQL backend that is part of CQPweb or the failure to run CQP.

Are you working with an up-to-date version of CQPweb checked out from the SVN repository?


> 3) After rebooting computer any search does not work at all:
> ERROR: CQP backend startup failed; the reported CQP version [] could not be parsed.
> But from the comman line I can perform search with 'cqp -e' and it seems to be working, at least I can see search results.

This suggests that you have CQP installed, but in a "private" path that's only visible to your user account and not to the Web server running CQPweb.  You may also need to configure CQPweb and set appropriate paths there.

> 4) Is it possible to choose ranges of periods in search according to the 'date'?
> <text id="" date=?????>

I think Andrew is working on support for date attributes in CQPweb.

In plain CQP, there are two ways of doing date searches:

a) The reasonable way: Store your dates in a simple standard format – I prefer ISO YYYY-MM-DD, so alphabetical and chronological sort order are the same – and then construct regular expressions for your suitable date ranges, e.g. in the global constraint of a CQP query:

        … :: match.text_date = "2011-03.*";  # anything in March 2011

        … :: match.text_date = "1990-(01-(1[2-9]|[23]\d)|02-.*|03-([0-1]\d|2[0-4]))";  # 12 Jan 1990 .. 24 Mar 1990

b) The "I'm a Unix hacker way": convert your dates to 32-bit integers and use numeric comparisons.  The obvious choice would be consecutive numbers for days (or even seconds as in Unix timestamps), but conversion from/to human-readable dates will be complicated.  However, you could encode the ISO-format above _without_ the hyphens to get 8-digit numbers, e.g.

        <text id="…" date="20180222">

and then cast to integers for numerical comparisons:

        … :: int(match.text_date) >= 19900112 & int(match.text_date) <= 19900324;

Nice trick, isn't it?

> 5) When I press 'Show tags' button I get
> 2012_ нче_ елда_ республикада_ 55_ мең_ 839_ бала_ дөньяга_ килгән_ ._
> but no tags.

That's because CQPweb failed to do proper HTML-escaping for the annotation strings (which is not only incovenient but also a security risk).

        @Andrew: has this bug been fixed in the lastest CQPweb code?

I've been bitten by similar issues before and would recommend avoiding HTML metacharacters (and other funny things) in annotation strings.  Better recode to something like

        n:sg:px3sp:nom

or even

        |n|sg|px3sp|nom|

so you can use the "contains" operator in searches.

> I think it is maybe because I didn't replace "<" and ">" in my morphological tags to their XML entities yet. Please, correct me if I'm wrong.

That won't help!  With -x, cwb-encode will decode the XML entities in your input file and you'll end up with < and > in the indexed corpus.  You could encode without the -x flag, but then your annotation strings will be

        &lt;n&gt;&lt;sg&gt;&lt;px3sp&gt;&lt;nom&gt;

which happens to display nicely only until HTML escaping in CQPweb is fixed – and you will have to search for

        [pos = ".*&lt;nom&gt;.*"]

instead of

        [pos = ".*<nom>.*"]

> 7) I also saw the button 'Export corpus -> Export whole corpus'. Does that mean that users can download the whole corpus? Is it possible to turn it off somehow?

AFAIK, only users with the "full access privilege" are allowed to download a corpus.  So if you want to disable downloads, simply keep to "normal access".


Best,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180223/f63fcaba/attachment-0001.html>


More information about the CWB mailing list