[CWB] Finding bad non-category-handle values

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Sep 25 04:34:32 CEST 2016


>> Many of them were fairly long, and some had two underscores together as a separator. I wonder if in either of these cases the values would be changed to an empty string.

No, definitely not. There must be some other issue explaining why the value in the s-attribute index is an empty string. Perhaps something in the format of your input file?

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 25 September 2016 03:26
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Finding bad non-category-handle values

On Sat, Sep 24, 2016 at 1:25 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

One possibility, given no results for a non-handle character, is that the bad values are empty strings – e.g. if there exist in the original data instances of <text> that did not have a source.
[...]
Incidentally, I’ve just checked in an amendment to the code which fixes a bug (identified while answering your question!) where a too-low maximum handle length was imposed, and also changes the “because there are non-category-handle values in the CWB index” error message to actually say what the bad value was. So, if you are using the bleeding edge code, you can svn up, try to change datatype again, and find out what the problem is that way.

Great! As a result of your update, I've figured out that the problem is one or more empty values:

The datatype of text_source cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .

Also, the following search returns many hits:

<text_source="">[];

So, for the benefit of anyone else who runs into this, I did the following query:

A = <text_source="">[];

And then I performed various and sundry queries like this until I was able to figure out what set of texts caused the problem:

tabulate A match text_whatever;

The odd thing is that none of the tagged texts had empty values for the field that was causing the problem. Many of them were fairly long, and some had two underscores together as a separator. I wonder if in either of these cases the values would be changed to an empty string.

Cheers,
Scott



From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 September 2016 16:48
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Finding bad non-category-handle values

On Sat, Sep 24, 2016 at 3:07 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

Hi Andrew,

Try a CQP query for

<whichever_att=".*[^a-zA-Z0-9_].*">[]

The s-attribute in question is text_source, so I ran the following in CQP:

<text_source=".*[^a-zA-Z0-9_].*">[]

And it produced 0 hits. Same happens with this:

<text_source=".*[^a-z0-9_].*">[]

This would seem to indicate that all the values of text_source are licit, but CQPweb disagrees.


and then  tabulate match whichever_att ?

This just gives me an error:

tabulate match source_text ?;
CQP Error:
            CQP Syntax Error: syntax error, unexpected FIELD, expecting ID or NQRID
            tabulate match  <--
Synchronizing to end of line ...

Cheers,
Scott


From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 September 2016 04:10
To: Open source development of the Corpus WorkBench
Cc: Open source development of the Corpus WorkBench
Subject: [CWB] Finding bad non-category-handle values

I'm attempting to import a corpus into CQPweb, and when I try to change one of the s-attributes from "free text" to "classification", I get the following error:

The datatype of text_source cannot be changed to [classification], because there are non-category-handle values in the CWB index.

I understand this to mean that in one or more values of text_source, there's a character that's not a-z or _. My question is simply how do I get a list of these values in order to figure out which one is causing the problem and then fix it?

Thanks in advance!
Scott

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160925/63aab01d/attachment-0001.html>


More information about the CWB mailing list