[CWB] frequency list and change of classification fail in CQPweb

Wed Mar 1 00:05:23 CET 2017

Hi Stéphie, Thilo, & everyone

Thilo’s answer covered the main point of Stéphie’s query – I have some (belated!) followup remarks to these two posts from last week however, to clarify and expand on the issues raised.

Re Stéphie’s query 1.

>> Error # 1062: Duplicate entry '' for key 'PRIMARY'

Thilo accurately points out that the “duplicate entry” message means there is a non-unique text_id. But note that the error acutally tells you what the duplicate value is – it is definitely the empty string – as shown by the two quote marks with nothing between them, highlighted in BOLD UNDERLINE above.

So, Stéphie, your corpus contains AT LEAST two <text> elements that either have no id, or where the id is an empty string. This is not valid in CQPweb. every text must have an ID, and the ID must be unique. You need to delete your corpus, fix the input files, then start again.

Re Stéphie’s query 2.

>> The datatype of text_year cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .

Once again, the error message tells you what the first non-handle value is: it’s an empy string (nothing between the [] delimiters = empty string.) So the corpus has at least one <text> with no year= , or with a year= whose value is empty.

The other thing that can cause a value to be judged non-handle is a bad character. The “good” characters are A-Z, a-z, 0-9 and underscore. The “bad” characters are everything else. (Note: these are the same rules as for variable in names in C/C++/Java/JAvascript/etc.; in regex, \w = good character and \W = bad character). This is explained in the admin manual on p 50 / 51: http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf But, as the error messages indicate, we are dealing in these particular chases with empty strings rather than nonhandle characters.

So, re: Thilo’s comment,
For the feature wishlist: for debugging a corpus file it would be very helpful if CQPWeb gave hints as to where in a file it encountered errors – e.g. the first duplicate text id, the encountered non-category-handle value or an error position when building metadata from xml. But I should also say that I have no clue as to the amount of work this would cause or the challenges involved...
... I have in fact already done this for XML datatype changes and non handles. I’ve not done it for metadata input files, but that is definitely on the list. (Basically it will involve a first-pass over the input file to validate it relative to the declared metadata strucrture. But installation is  a two step procedure already for other reasons, so this is eminently doable. Alas, time, time!)

HOWEVER it’s worth noting that because “text” is a special entity in CQPweb, which has its own metadata separate to the index (for counterintuitive historical reasons) changing the datatype of text_year to classification won’t actually do what you want it to do. What you probably want is to import the text metadata table from XML, which will create a text metadata field called (in this case) “year” from the text_year s-attribute.

I know this is not exactly the clearest thing – it does keep tripping people up, but I can’t think of a simple way to explain it!

RE Stéphie’s debugging.

>>>>
Steps undertaken so far:
- checked similar posts in mailing list (Scott Sadowsky (Sep 2016) had a similar problem Nr. 2. but no definite solution was given)
- I wanted to check my database in a similar way
$ <text_year="">[];
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<text_year="">[]' at line 1
 <<<

you’ve misread that post – the query <text_year="">[];  is not a query to run in the database, it’s a query to run in command-line CQP. It will work there. It won’t work in MySQL As you’ve found.

>>>>
Steps undertaken so far:
- checked XML validity of input .vrt-file: xmllint shows 2 type of errors which should not be an issue (error ‘huge text node’ and multiple errors for …&lt/…&gt not followed by ;)
 <<<

A vrt file is only pseudo-XML, as it is possible to use tag structures that are illegal XML but perfectly acceptable to CWB. (An example: XML demands perfect nesting and a root element; CWB doesn’t, there need be no root and nesting is irrelevant). Ergo, xmllint is not really a suitable tool.

However, one thing it has picked out hat I would heartily recommend fixing is those entities. CQPweb always uses CWB’s “(pseudo-)XML-aware” encoding mode, which interprets the built-in entities back to their underlying characters. So &lt, &gt DEFINITELY ought to be properly terminated!

Re Stéphie’s last question,

>>>- Also, I think it is strange that the XML elements <pb n=”NUMBER”></pb> (element between <text> … </text>) is classified by CWB as a ‘free text’ element while all other elements (e.g. footnote) are correctly not classified as such. Is this a further sign that something is messed up in my data file?

The CQPweb data types of the XML attributes in your corpus depend on what you declared at time of indexing in the “s-attributes” stage of the setup – CQPweb does not guess datatypes for you.

That said, “free text” is what you probably want for pb_n , isn’t it? the page number is certainly not a classification that could be used to segment the corpus, it would not make sense to say “find me all words within that part of the corpus where pb_n is equal to 21”.  So free text is likely the right way to go here.

(Incidentally -- “free text” just means that CQPweb avoids making any additional expectations about the content, e.g. the extra assumptions that attach to the “classification” datatype are not made. You can of course make whatever rules about the content you like for your own purposes, e.g. that n is a number.)

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Thilo Wiertz
Sent: 21 February 2017 15:56
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] frequency list and change of classification fail in CQPweb

Dear Stéphie,

I've struggled quite a bit with similar errors while preparing a corpus generated from online news articles. 1. is clearly a sign that there is a non-unique (or possibly empty) entry in one of the <text id=...> entries. (ID is the primary key for text identification so that's what "Duplicate entry '' for key 'PRIMARY'" is about. In my case there were a few empty string or 0 values from errornous html to xml conversions.)

I've also encountered 2. without being able to find the supposedly non-classification entry. I remember I suspected some characters in meta tags to cause some trouble (e.g. '_' or '-' in category values?) and special/invalid characters in the text, but did not systematically check. Is it possible that the xmllint errors you find point to unclosed tags or "<" ">" characters in the text that may cause problems?

My solution was to rebuild the xml from scratch and run each string and element through checks and conditions that make absolutely sure that no empty, invalid or, in case of the text ids, duplicate values get written into the xml files. For the feature wishlist: for debugging a corpus file it would be very helpful if CQPWeb gave hints as to where in a file it encountered errors – e.g. the first duplicate text id, the encountered non-category-handle value or an error position when building metadata from xml. But I should also say that I have no clue as to the amount of work this would cause or the challenges involved...

Best,
Thilo

Am 21.02.2017 um 14:40 schrieb Lehner Stéphanie (lehs) <lehs at zhaw.ch<mailto:lehs at zhaw.ch>>:

Dear CWB community

I am currently attempting to set up a corpus of about 50’000’000 token in CQPweb. Unfortunately, the following 2 commands both fail; as a CWB newbie, I am a bit at loss as to how I can find and tackle the root cause of these issues.

1. I cannot create frequency lists: ‘Generate CWB text-position records’ fails with the following error message:

A MySQL query did not run successfully!
Original query: insert into ___temp_cqp_text_positions_for_bge_1875_2015_de (text_id, cqp_begin, cqp_end) VALUES ('BGE08965', 38695514, 38706960),('BGE08966', 38706961, 38707670),('BGE08967', 38707671, 38709579),('BGE08968', 38709580, 38711971),('BGE08969', 38711972, 38715398), [shortened, it keeps on listing every single following BGE i.e. text ID in the corpus] /* from User: lehs_admin | Function: populate_corpus_cqp_positions() | 2017-Feb-13 15:06:54 */
Error # 1062: Duplicate entry '' for key 'PRIMARY'

2. I would like to change the datatype of some XML to classification. I am aware that they need to meet handle criteria; also, I don’t think I have any empty

The datatype of text_year cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .

Infos:
CWB Version: Release 3.5 (Alpha)
CQPweb code: 3.2.26, Revision 924
Database: 3.2.25
Ubuntu X64 16.04
PHP 7 , apache2
(VirtualBox)

Steps undertaken so far:
- checked similar posts in mailing list (Scott Sadowsky (Sep 2016) had a similar problem Nr. 2. but no definite solution was given)
- I wanted to check my database in a similar way
$ <text_year="">[];
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<text_year="">[]' at line 1
- mysqlcheck –c says OK to database

- checked XML validity of input .vrt-file: xmllint shows 2 type of errors which should not be an issue (error ‘huge text node’ and multiple errors for …&lt/…&gt not followed by ;)

- Also, I think it is strange that the XML elements <pb n=”NUMBER”></pb> (element between <text> … </text>) is classified by CWB as a ‘free text’ element while all other elements (e.g. footnote) are correctly not classified as such. Is this a further sign that something is messed up in my data file?

Thank you kindly for your support!
Best
Stéphie

***
output of cwb-describe-corpus:

============================================================
Corpus: BGE_1875_2015_DE
============================================================

description:
registry file:  /usr/local/cwb-3.4.10/share/cwb/registry/bge_1875_2015_de
home directory: /usr/local/corpora/bge_1875_2015_de/
info file:      /usr/local/corpora/bge_1875_2015_de/.info
encoding:       utf8
size (tokens):  49086350

  3 positional attributes:
      word            pos             lemma

22 structural attributes:
      body            p               pb              pb_n
      head            footnote        text            text_id
      text_author     text_title      text_source     text_page
      text_topics     text_subtopics  text_language   text_date
      text_description                text_type       text_file
      text_year       text_decade     text_url

  0 alignment  attributes:

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170228/ae6198e5/attachment-0001.html>