[CWB] frequency list and change of classification fail in CQPweb

Thilo Wiertz thilo.wiertz at geographie.uni-freiburg.de
Tue Feb 21 16:56:12 CET 2017


Dear Stéphie,

I've struggled quite a bit with similar errors while preparing a corpus generated from online news articles. 1. is clearly a sign that there is a non-unique (or possibly empty) entry in one of the <text id=...> entries. (ID is the primary key for text identification so that's what "Duplicate entry '' for key 'PRIMARY'" is about. In my case there were a few empty string or 0 values from errornous html to xml conversions.) 

I've also encountered 2. without being able to find the supposedly non-classification entry. I remember I suspected some characters in meta tags to cause some trouble (e.g. '_' or '-' in category values?) and special/invalid characters in the text, but did not systematically check. Is it possible that the xmllint errors you find point to unclosed tags or "<" ">" characters in the text that may cause problems? 

My solution was to rebuild the xml from scratch and run each string and element through checks and conditions that make absolutely sure that no empty, invalid or, in case of the text ids, duplicate values get written into the xml files. For the feature wishlist: for debugging a corpus file it would be very helpful if CQPWeb gave hints as to where in a file it encountered errors – e.g. the first duplicate text id, the encountered non-category-handle value or an error position when building metadata from xml. But I should also say that I have no clue as to the amount of work this would cause or the challenges involved...

Best,
Thilo


> Am 21.02.2017 um 14:40 schrieb Lehner Stéphanie (lehs) <lehs at zhaw.ch>:
> 
> Dear CWB community
>  
> I am currently attempting to set up a corpus of about 50’000’000 token in CQPweb. Unfortunately, the following 2 commands both fail; as a CWB newbie, I am a bit at loss as to how I can find and tackle the root cause of these issues.
>  
> 1. I cannot create frequency lists: ‘Generate CWB text-position records’ fails with the following error message:
>  
> A MySQL query did not run successfully!
> Original query: insert into ___temp_cqp_text_positions_for_bge_1875_2015_de (text_id, cqp_begin, cqp_end) VALUES ('BGE08965', 38695514, 38706960),('BGE08966', 38706961, 38707670),('BGE08967', 38707671, 38709579),('BGE08968', 38709580, 38711971),('BGE08969', 38711972, 38715398), [shortened, it keeps on listing every single following BGE i.e. text ID in the corpus] /* from User: lehs_admin | Function: populate_corpus_cqp_positions() | 2017-Feb-13 15:06:54 */
> Error # 1062: Duplicate entry '' for key 'PRIMARY'
>  
> 2. I would like to change the datatype of some XML to classification. I am aware that they need to meet handle criteria; also, I don’t think I have any empty
>  
> The datatype of text_year cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .
>  
> Infos: 
> CWB Version: Release 3.5 (Alpha)
> CQPweb code: 3.2.26, Revision 924
> Database: 3.2.25
> Ubuntu X64 16.04
> PHP 7 , apache2
> (VirtualBox)
>  
> Steps undertaken so far:
> - checked similar posts in mailing list (Scott Sadowsky (Sep 2016) had a similar problem Nr. 2. but no definite solution was given)
> - I wanted to check my database in a similar way
> $ <text_year="">[];
> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<text_year="">[]' at line 1
> - mysqlcheck –c says OK to database
>  
> - checked XML validity of input .vrt-file: xmllint shows 2 type of errors which should not be an issue (error ‘huge text node’ and multiple errors for …&lt/…&gt not followed by ;)
>  
> - Also, I think it is strange that the XML elements <pb n=”NUMBER”></pb> (element between <text> … </text>) is classified by CWB as a ‘free text’ element while all other elements (e.g. footnote) are correctly not classified as such. Is this a further sign that something is messed up in my data file?
>  
>  
> Thank you kindly for your support!
> Best
> Stéphie
>  
>  
> ***
> output of cwb-describe-corpus:
>  
> ============================================================
> Corpus: BGE_1875_2015_DE
> ============================================================
>  
> description:   
> registry file:  /usr/local/cwb-3.4.10/share/cwb/registry/bge_1875_2015_de
> home directory: /usr/local/corpora/bge_1875_2015_de/
> info file:      /usr/local/corpora/bge_1875_2015_de/.info
> encoding:       utf8
> size (tokens):  49086350
>  
>   3 positional attributes:
>       word            pos             lemma          
>  
> 22 structural attributes:
>       body            p               pb              pb_n           
>       head            footnote        text            text_id        
>       text_author     text_title      text_source     text_page      
>       text_topics     text_subtopics  text_language   text_date      
>       text_description                text_type       text_file      
>       text_year       text_decade     text_url       
>  
>   0 alignment  attributes:
>  
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb <http://liste.sslmit.unibo.it/mailman/listinfo/cwb>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170221/223c43ec/attachment-0001.html>


More information about the CWB mailing list