[CWB] Format for metadata files?
Hardie, Andrew
a.hardie at lancaster.ac.uk
Sat Dec 3 21:18:29 CET 2016
Weird, I can't fathom this one. The error arises from the following check on the database
select distinct `text_id` from text_metadata_for_$corpus where `text_id` REGEXP '[^A-Za-z0-9_]'
I can't see any reason why the text IDs listed as erroneous in your output would be matched by that regex. Are there, perhaps, rogue non-printing characters in the file? (e.g. a Unicode byte mark at the start of each line?)
If you open the file in a regex-capable text editor, you might be able to find the problem using the regex above, ie
[^A-Za-z0-9_]
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Graham Ranger -- UAPV
Sent: 03 December 2016 17:19
To: Open source development of the Corpus WorkBench
Subject: [CWB] Format for metadata files?
Hello,
I'm getting the following error message when I try to load the metadata
file for a corpus:
The data source you specified for the text metadata contains
badly-formatted text ID codes, as follows: <strong>
'assollant_rose_d_amour'; 'bruno_le_tour_de_la_france';
'bruyere_l_epee_de_charlemagne'; 'daudet_lettres_de_mon_moulin';
'malot_sans_famille'; 'marcel_les_petits_vagabonds';
'robida_les_assieges_de_compiegne'; 'segur_malheurs_de_sophie';
'segur_un_bon_petit_diable'; 'verne_cinq_semaines_en_ballon';
'verne_le_tour_du_monde'; 'zola_nouveaux_contes_a_ninon';</strong>
(text ids can only contain unaccented letters, numbers, and underscore).
The metadata is in a file called jeunesse.meta in which each line begins
with the text id of the texts in the corpus.
Inside the metadata file, the lines read as follows:
assollant_rose_d_amour alfred_assollant rose_d_amour 1889
1850_1899 roman avance
bruno_le_tour_de_la_france bruno le_tour_de_la_france 1877
1850-1899 manuel_scolaire elementaire
etc.
with text id, author, title, date, period, genre and level.
I can't see what is wrong with the file: the error message suggests that
it's formatted as <strong>, but it's just plain text!
Thanks as always for any help.
Best,
Graham.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list