[CWB] Escape "<" and ">" symbols

mansur 6688000 at gmail.com
Mon Mar 5 13:45:32 CET 2018


Hello, Stefan, Andrew and others!!!

You advised to use tagging style like:

n:sg:px3sp:nom
or
n|sg|px3sp|nom

Is there any particular reason why ":" or "|" instead of "<" or ">". Is it
possible to use "," (comma)? What do you usually use in your projects?

Thank you!
With best wishes,
Mansur


On 22 February 2018 at 11:52, Stefan Evert <stefanML at collocations.de> wrote:

> Dear Mansur,
>
> most of the remaining issues are related to CQPweb, so Andrew will be in a
> much better position to answer them and help you with the debugging.  Some
> of them are clearly (mis-)configuration issues, e.g. the failure to locate
> the CEQL backend that is part of CQPweb or the failure to run CQP.
>
> Are you working with an up-to-date version of CQPweb checked out from the
> SVN repository?
>
>
> > 3) After rebooting computer any search does not work at all:
> > ERROR: CQP backend startup failed; the reported CQP version [] could not
> be parsed.
> > But from the comman line I can perform search with 'cqp -e' and it seems
> to be working, at least I can see search results.
>
> This suggests that you have CQP installed, but in a "private" path that's
> only visible to your user account and not to the Web server running
> CQPweb.  You may also need to configure CQPweb and set appropriate paths
> there.
>
> > 4) Is it possible to choose ranges of periods in search according to the
> 'date'?
> > <text id="" date=?????>
>
> I think Andrew is working on support for date attributes in CQPweb.
>
> In plain CQP, there are two ways of doing date searches:
>
> a) The reasonable way: Store your dates in a simple standard format – I
> prefer ISO YYYY-MM-DD, so alphabetical and chronological sort order are the
> same – and then construct regular expressions for your suitable date
> ranges, e.g. in the global constraint of a CQP query:
>
>         … :: match.text_date = "2011-03.*";  # anything in March 2011
>
>         … :: match.text_date = "1990-(01-(1[2-9]|[23]\d)|02-.*|03-([0-1]\d|2[0-4]))";
> # 12 Jan 1990 .. 24 Mar 1990
>
> b) The "I'm a Unix hacker way": convert your dates to 32-bit integers and
> use numeric comparisons.  The obvious choice would be consecutive numbers
> for days (or even seconds as in Unix timestamps), but conversion from/to
> human-readable dates will be complicated.  However, you could encode the
> ISO-format above _without_ the hyphens to get 8-digit numbers, e.g.
>
>         <text id="…" date="20180222">
>
> and then cast to integers for numerical comparisons:
>
>         … :: int(match.text_date) >= 19900112 & int(match.text_date) <=
> 19900324;
>
> Nice trick, isn't it?
>
> > 5) When I press 'Show tags' button I get
> > 2012_ нче_ елда_ республикада_ 55_ мең_ 839_ бала_ дөньяга_ килгән_ ._
> > but no tags.
>
> That's because CQPweb failed to do proper HTML-escaping for the annotation
> strings (which is not only incovenient but also a security risk).
>
>         @Andrew: has this bug been fixed in the lastest CQPweb code?
>
> I've been bitten by similar issues before and would recommend avoiding
> HTML metacharacters (and other funny things) in annotation strings.  Better
> recode to something like
>
>         n:sg:px3sp:nom
>
> or even
>
>         |n|sg|px3sp|nom|
>
> so you can use the "contains" operator in searches.
>
> > I think it is maybe because I didn't replace "<" and ">" in my
> morphological tags to their XML entities yet. Please, correct me if I'm
> wrong.
>
> That won't help!  With -x, cwb-encode will decode the XML entities in your
> input file and you'll end up with < and > in the indexed corpus.  You could
> encode without the -x flag, but then your annotation strings will be
>
>         &lt;n&gt;&lt;sg&gt;&lt;px3sp&gt;&lt;nom&gt;
>
> which happens to display nicely only until HTML escaping in CQPweb is
> fixed – and you will have to search for
>
>         [pos = ".*&lt;nom&gt;.*"]
>
> instead of
>
>         [pos = ".*<nom>.*"]
>
> > 7) I also saw the button 'Export corpus -> Export whole corpus'. Does
> that mean that users can download the whole corpus? Is it possible to turn
> it off somehow?
>
> AFAIK, only users with the "full access privilege" are allowed to download
> a corpus.  So if you want to disable downloads, simply keep to "normal
> access".
>
>
> Best,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180305/bf0eb0e5/attachment.html>


More information about the CWB mailing list