[CWB] frequency list and change of classification fail in CQPweb

Lehner Stéphanie (lehs) lehs at zhaw.ch
Tue Feb 21 14:40:02 CET 2017


Dear CWB community

I am currently attempting to set up a corpus of about 50'000'000 token in CQPweb. Unfortunately, the following 2 commands both fail; as a CWB newbie, I am a bit at loss as to how I can find and tackle the root cause of these issues.

1. I cannot create frequency lists: 'Generate CWB text-position records' fails with the following error message:

A MySQL query did not run successfully!
Original query: insert into ___temp_cqp_text_positions_for_bge_1875_2015_de (text_id, cqp_begin, cqp_end) VALUES ('BGE08965', 38695514, 38706960),('BGE08966', 38706961, 38707670),('BGE08967', 38707671, 38709579),('BGE08968', 38709580, 38711971),('BGE08969', 38711972, 38715398), [shortened, it keeps on listing every single following BGE i.e. text ID in the corpus] /* from User: lehs_admin | Function: populate_corpus_cqp_positions() | 2017-Feb-13 15:06:54 */
Error # 1062: Duplicate entry '' for key 'PRIMARY'

2. I would like to change the datatype of some XML to classification. I am aware that they need to meet handle criteria; also, I don't think I have any empty

The datatype of text_year cannot be changed to [classification], because there are non-category-handle values in the CWB index; the first non-handle value found in the index is [] .

Infos:
CWB Version: Release 3.5 (Alpha)
CQPweb code: 3.2.26, Revision 924
Database: 3.2.25
Ubuntu X64 16.04
PHP 7 , apache2
(VirtualBox)

Steps undertaken so far:
- checked similar posts in mailing list (Scott Sadowsky (Sep 2016) had a similar problem Nr. 2. but no definite solution was given)
- I wanted to check my database in a similar way
$ <text_year="">[];
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<text_year="">[]' at line 1
- mysqlcheck -c says OK to database

- checked XML validity of input .vrt-file: xmllint shows 2 type of errors which should not be an issue (error 'huge text node' and multiple errors for ...&lt/...&gt not followed by ;)

- Also, I think it is strange that the XML elements <pb n="NUMBER"></pb> (element between <text> ... </text>) is classified by CWB as a 'free text' element while all other elements (e.g. footnote) are correctly not classified as such. Is this a further sign that something is messed up in my data file?


Thank you kindly for your support!
Best
Stéphie


***
output of cwb-describe-corpus:

============================================================
Corpus: BGE_1875_2015_DE
============================================================

description:
registry file:  /usr/local/cwb-3.4.10/share/cwb/registry/bge_1875_2015_de
home directory: /usr/local/corpora/bge_1875_2015_de/
info file:      /usr/local/corpora/bge_1875_2015_de/.info
encoding:       utf8
size (tokens):  49086350

  3 positional attributes:
      word            pos             lemma

22 structural attributes:
      body            p               pb              pb_n
      head            footnote        text            text_id
      text_author     text_title      text_source     text_page
      text_topics     text_subtopics  text_language   text_date
      text_description                text_type       text_file
      text_year       text_decade     text_url

  0 alignment  attributes:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170221/01f7e463/attachment.html>


More information about the CWB mailing list