[CWB] Escape "<" and ">" symbols

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Mar 5 13:53:23 CET 2018


If you use | then you can treat the attribute as a feature set. This might be useful. You can see a description of what feature sets allow you to do in the encoding tutorial.

If you don’t care about it being a feature set, then you can use any character. People often  do use : as a joiner, but there is no reason not to use ; or , instead if that makes more sense for your purposes. I’d suggest not using . because it is a regular expression metacharacter.

best

Andrew.


From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of mansur
Sent: 05 March 2018 12:46
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Escape "<" and ">" symbols

Hello, Stefan, Andrew and others!!!
You advised to use tagging style like:

n:sg:px3sp:nom
or
n|sg|px3sp|nom
Is there any particular reason why ":" or "|" instead of "<" or ">". Is it possible to use "," (comma)? What do you usually use in your projects?

Thank you!
With best wishes,
Mansur

On 22 February 2018 at 11:52, Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>> wrote:
Dear Mansur,

most of the remaining issues are related to CQPweb, so Andrew will be in a much better position to answer them and help you with the debugging.  Some of them are clearly (mis-)configuration issues, e.g. the failure to locate the CEQL backend that is part of CQPweb or the failure to run CQP.

Are you working with an up-to-date version of CQPweb checked out from the SVN repository?


> 3) After rebooting computer any search does not work at all:
> ERROR: CQP backend startup failed; the reported CQP version [] could not be parsed.
> But from the comman line I can perform search with 'cqp -e' and it seems to be working, at least I can see search results.

This suggests that you have CQP installed, but in a "private" path that's only visible to your user account and not to the Web server running CQPweb.  You may also need to configure CQPweb and set appropriate paths there.

> 4) Is it possible to choose ranges of periods in search according to the 'date'?
> <text id="" date=?????>

I think Andrew is working on support for date attributes in CQPweb.

In plain CQP, there are two ways of doing date searches:

a) The reasonable way: Store your dates in a simple standard format – I prefer ISO YYYY-MM-DD, so alphabetical and chronological sort order are the same – and then construct regular expressions for your suitable date ranges, e.g. in the global constraint of a CQP query:

        … :: match.text_date = "2011-03.*";  # anything in March 2011

        … :: match.text_date = "1990-(01-(1[2-9]|[23]\d)|02-.*|03-([0-1]\d|2[0-4]))";  # 12 Jan 1990 .. 24 Mar 1990

b) The "I'm a Unix hacker way": convert your dates to 32-bit integers and use numeric comparisons.  The obvious choice would be consecutive numbers for days (or even seconds as in Unix timestamps), but conversion from/to human-readable dates will be complicated.  However, you could encode the ISO-format above _without_ the hyphens to get 8-digit numbers, e.g.

        <text id="…" date="20180222">

and then cast to integers for numerical comparisons:

        … :: int(match.text_date) >= 19900112 & int(match.text_date) <= 19900324;

Nice trick, isn't it?

> 5) When I press 'Show tags' button I get
> 2012_ нче_ елда_ республикада_ 55_ мең_ 839_ бала_ дөньяга_ килгән_ ._
> but no tags.

That's because CQPweb failed to do proper HTML-escaping for the annotation strings (which is not only incovenient but also a security risk).

        @Andrew: has this bug been fixed in the lastest CQPweb code?

I've been bitten by similar issues before and would recommend avoiding HTML metacharacters (and other funny things) in annotation strings.  Better recode to something like

        n:sg:px3sp:nom

or even

        |n|sg|px3sp|nom|

so you can use the "contains" operator in searches.

> I think it is maybe because I didn't replace "<" and ">" in my morphological tags to their XML entities yet. Please, correct me if I'm wrong.

That won't help!  With -x, cwb-encode will decode the XML entities in your input file and you'll end up with < and > in the indexed corpus.  You could encode without the -x flag, but then your annotation strings will be

        &lt;n&gt;&lt;sg&gt;&lt;px3sp&gt;&lt;nom&gt;

which happens to display nicely only until HTML escaping in CQPweb is fixed – and you will have to search for

        [pos = ".*&lt;nom&gt;.*"]

instead of

        [pos = ".*<nom>.*"]

> 7) I also saw the button 'Export corpus -> Export whole corpus'. Does that mean that users can download the whole corpus? Is it possible to turn it off somehow?

AFAIK, only users with the "full access privilege" are allowed to download a corpus.  So if you want to disable downloads, simply keep to "normal access".


Best,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180305/2eb762d2/attachment-0001.html>


More information about the CWB mailing list