[CWB] Format for metadata files?

Hardie, Andrew a.hardie at lancaster.ac.uk
Sat Dec 3 21:18:29 CET 2016


Weird, I can't fathom this one. The error arises from the following check on the database

  select distinct `text_id` from text_metadata_for_$corpus where `text_id` REGEXP '[^A-Za-z0-9_]'

I can't see any reason why the text IDs listed as erroneous in your output would be matched by that regex. Are there, perhaps, rogue non-printing characters in the file? (e.g. a Unicode byte mark at the start of each line?)

If you open the file in a regex-capable text editor, you might be able to find the problem using the regex above, ie

[^A-Za-z0-9_]

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Graham Ranger -- UAPV
Sent: 03 December 2016 17:19
To: Open source development of the Corpus WorkBench
Subject: [CWB] Format for metadata files?

Hello,
I'm getting the following error message when I try to load the metadata 
file for a corpus:

The data source you specified for the text metadata contains 
badly-formatted text ID codes, as follows: <strong> 
'assollant_rose_d_amour'; 'bruno_le_tour_de_la_france'; 
'bruyere_l_epee_de_charlemagne'; 'daudet_lettres_de_mon_moulin'; 
'malot_sans_famille'; 'marcel_les_petits_vagabonds'; 
'robida_les_assieges_de_compiegne'; 'segur_malheurs_de_sophie'; 
'segur_un_bon_petit_diable'; 'verne_cinq_semaines_en_ballon'; 
'verne_le_tour_du_monde'; 'zola_nouveaux_contes_a_ninon';</strong> 
(text ids can only contain unaccented letters, numbers, and underscore).

The metadata is in a file called jeunesse.meta in which each line begins 
with the text id of the texts in the corpus.
Inside the metadata file, the lines read as follows:

assollant_rose_d_amour    alfred_assollant    rose_d_amour 1889    
1850_1899    roman    avance
bruno_le_tour_de_la_france    bruno    le_tour_de_la_france 1877    
1850-1899    manuel_scolaire    elementaire
etc.

with text id, author, title, date, period, genre and level.

I can't see what is wrong with the file: the error message suggests that 
it's formatted as <strong>, but it's just plain text!
Thanks as always for any help.
Best,
Graham.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list