[CWB] Escape "<" and ">" symbols

Stefan Evert stefanML at collocations.de
Thu Feb 22 09:52:24 CET 2018


Dear Mansur,

most of the remaining issues are related to CQPweb, so Andrew will be in a much better position to answer them and help you with the debugging.  Some of them are clearly (mis-)configuration issues, e.g. the failure to locate the CEQL backend that is part of CQPweb or the failure to run CQP.

Are you working with an up-to-date version of CQPweb checked out from the SVN repository?


> 3) After rebooting computer any search does not work at all:
> ERROR: CQP backend startup failed; the reported CQP version [] could not be parsed.
> But from the comman line I can perform search with 'cqp -e' and it seems to be working, at least I can see search results.

This suggests that you have CQP installed, but in a "private" path that's only visible to your user account and not to the Web server running CQPweb.  You may also need to configure CQPweb and set appropriate paths there.

> 4) Is it possible to choose ranges of periods in search according to the 'date'?
> <text id="" date=?????>

I think Andrew is working on support for date attributes in CQPweb.

In plain CQP, there are two ways of doing date searches:

a) The reasonable way: Store your dates in a simple standard format – I prefer ISO YYYY-MM-DD, so alphabetical and chronological sort order are the same – and then construct regular expressions for your suitable date ranges, e.g. in the global constraint of a CQP query:

	… :: match.text_date = "2011-03.*";  # anything in March 2011

	… :: match.text_date = "1990-(01-(1[2-9]|[23]\d)|02-.*|03-([0-1]\d|2[0-4]))";  # 12 Jan 1990 .. 24 Mar 1990

b) The "I'm a Unix hacker way": convert your dates to 32-bit integers and use numeric comparisons.  The obvious choice would be consecutive numbers for days (or even seconds as in Unix timestamps), but conversion from/to human-readable dates will be complicated.  However, you could encode the ISO-format above _without_ the hyphens to get 8-digit numbers, e.g.

	<text id="…" date="20180222">

and then cast to integers for numerical comparisons:

	… :: int(match.text_date) >= 19900112 & int(match.text_date) <= 19900324;

Nice trick, isn't it?

> 5) When I press 'Show tags' button I get
> 2012_ нче_ елда_ республикада_ 55_ мең_ 839_ бала_ дөньяга_ килгән_ ._
> but no tags.

That's because CQPweb failed to do proper HTML-escaping for the annotation strings (which is not only incovenient but also a security risk).

	@Andrew: has this bug been fixed in the lastest CQPweb code?

I've been bitten by similar issues before and would recommend avoiding HTML metacharacters (and other funny things) in annotation strings.  Better recode to something like

	n:sg:px3sp:nom

or even

	|n|sg|px3sp|nom|

so you can use the "contains" operator in searches.

> I think it is maybe because I didn't replace "<" and ">" in my morphological tags to their XML entities yet. Please, correct me if I'm wrong.

That won't help!  With -x, cwb-encode will decode the XML entities in your input file and you'll end up with < and > in the indexed corpus.  You could encode without the -x flag, but then your annotation strings will be

	&lt;n&gt;&lt;sg&gt;&lt;px3sp&gt;&lt;nom&gt;

which happens to display nicely only until HTML escaping in CQPweb is fixed – and you will have to search for

 	[pos = ".*&lt;nom&gt;.*"]

instead of

	[pos = ".*<nom>.*"]

> 7) I also saw the button 'Export corpus -> Export whole corpus'. Does that mean that users can download the whole corpus? Is it possible to turn it off somehow?

AFAIK, only users with the "full access privilege" are allowed to download a corpus.  So if you want to disable downloads, simply keep to "normal access".


Best,
Stefan



More information about the CWB mailing list