[CWB] Finding bad non-category-handle values

Scott Sadowsky ssadowsky at gmail.com
Sun Sep 25 04:25:48 CEST 2016


On Sat, Sep 24, 2016 at 1:25 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

One possibility, given no results for a non-handle character, is that the
> bad values are empty strings – e.g. if there exist in the original data
> instances of <*text*> that did not have a *source*.
>
> [...]
>
> Incidentally, I’ve just checked in an amendment to the code which fixes a
> bug (identified while answering your question!) where a too-low maximum
> handle length was imposed, and also changes the “because there are
> non-category-handle values in the CWB index” error message to actually say
> what the bad value was. So, if you are using the bleeding edge code, you
> can *svn up*, try to change datatype again, and find out what the problem
> is that way.
>

Great! As a result of your update, I've figured out that the problem is one
or more empty values:

The datatype of text_source cannot be changed to [classification], because
there are non-category-handle values in the CWB index; the first non-handle
value found in the index is [] .


Also, the following search returns many hits:

<text_source="">[];


So, for the benefit of anyone else who runs into this, I did the following
query:

A = <text_source="">[];


And then I performed various and sundry queries like this until I was able
to figure out what set of texts caused the problem:

tabulate A match text_whatever;

The odd thing is that none of the tagged texts had empty values for the
field that was causing the problem. Many of them were fairly long, and some
had two underscores together as a separator. I wonder if in either of these
cases the values would be changed to an empty string.

Cheers,
Scott



*From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Scott Sadowsky
>
> *Sent:* 24 September 2016 16:48
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Finding bad non-category-handle values
>
>
>
> On Sat, Sep 24, 2016 at 3:07 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> Hi Andrew,
>
>
>
> Try a CQP query for
>
>
>
> <whichever_att=".*[^a-zA-Z0-9_].*">[]
>
>
>
> The s-attribute in question is *text_source*, so I ran the following in
> CQP:
>
>
>
> <text_source=".*[^a-zA-Z0-9_].*">[]
>
>
>
> And it produced 0 hits. Same happens with this:
>
>
>
> <text_source=".*[^a-z0-9_].*">[]
>
>
>
> This would seem to indicate that all the values of *text_source* are
> licit, but CQPweb disagrees.
>
>
>
>
>
> and then  tabulate *match whichever_att* ?
>
>
>
> This just gives me an error:
>
>
>
> tabulate match source_text ?;
>
> CQP Error:
>
>             CQP Syntax Error: syntax error, unexpected FIELD, expecting
> ID or NQRID
>
>             tabulate match  <--
>
> Synchronizing to end of line ...
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Scott Sadowsky
> *Sent:* 24 September 2016 04:10
> *To:* Open source development of the Corpus WorkBench
> *Cc:* Open source development of the Corpus WorkBench
> *Subject:* [CWB] Finding bad non-category-handle values
>
>
>
> I'm attempting to import a corpus into CQPweb, and when I try to change
> one of the s-attributes from "free text" to "classification", I get the
> following error:
>
>
>
> *The datatype of text_source cannot be changed to [classification],
> because there are non-category-handle values in the CWB index.*
>
>
>
> I understand this to mean that in one or more values of text_source,
> there's a character that's not a-z or _. My question is simply how do I get
> a list of these values in order to figure out which one is causing the
> problem and then fix it?
>
>
>
> Thanks in advance!
>
> Scott
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160924/c9bd70f6/attachment.html>


More information about the CWB mailing list