[CWB] frequency list and change of classification fail in CQPweb
Lehner Stéphanie (lehs)
lehs at zhaw.ch
Tue Feb 21 14:40:02 CET 2017
Dear CWB community
I am currently attempting to set up a corpus of about 50'000'000 token in CQPweb. Unfortunately, the following 2 commands both fail; as a CWB newbie, I am a bit at loss as to how I can find and tackle the root cause of these issues.
1. I cannot create frequency lists: 'Generate CWB text-position records' fails with the following error message:
A MySQL query did not run successfully!
Original query: insert into ___temp_cqp_text_positions_for_bge_1875_2015_de (text_id, cqp_begin, cqp_end) VALUES ('BGE08965', 38695514, 38706960),('BGE08966', 38706961, 38707670),('BGE08967', 38707671, 38709579),('BGE08968', 38709580, 38711971),('BGE08969', 38711972, 38715398), [shortened, it keeps on listing every single following BGE i.e. text ID in the corpus] /* from User: lehs_admin | Function: populate_corpus_cqp_positions() | 2017-Feb-13 15:06:54 */
Error # 1062: Duplicate entry '' for key 'PRIMARY'
2. I would like to change the datatype of some XML to classification. I am aware that they need to meet handle criteria; also, I don't think I have any empty
The datatype of text_year cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .
Infos:
CWB Version: Release 3.5 (Alpha)
CQPweb code: 3.2.26, Revision 924
Database: 3.2.25
Ubuntu X64 16.04
PHP 7 , apache2
(VirtualBox)
Steps undertaken so far:
- checked similar posts in mailing list (Scott Sadowsky (Sep 2016) had a similar problem Nr. 2. but no definite solution was given)
- I wanted to check my database in a similar way
$ <text_year="">[];
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<text_year="">[]' at line 1
- mysqlcheck -c says OK to database
- checked XML validity of input .vrt-file: xmllint shows 2 type of errors which should not be an issue (error 'huge text node' and multiple errors for ...</...> not followed by ;)
- Also, I think it is strange that the XML elements <pb n="NUMBER"></pb> (element between <text> ... </text>) is classified by CWB as a 'free text' element while all other elements (e.g. footnote) are correctly not classified as such. Is this a further sign that something is messed up in my data file?
Thank you kindly for your support!
Best
Stéphie
***
output of cwb-describe-corpus:
============================================================
Corpus: BGE_1875_2015_DE
============================================================
description:
registry file: /usr/local/cwb-3.4.10/share/cwb/registry/bge_1875_2015_de
home directory: /usr/local/corpora/bge_1875_2015_de/
info file: /usr/local/corpora/bge_1875_2015_de/.info
encoding: utf8
size (tokens): 49086350
3 positional attributes:
word pos lemma
22 structural attributes:
body p pb pb_n
head footnote text text_id
text_author text_title text_source text_page
text_topics text_subtopics text_language text_date
text_description text_type text_file
text_year text_decade text_url
0 alignment attributes:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170221/01f7e463/attachment.html>
More information about the CWB
mailing list