[CWB] How to properly encode XML-like tokens?

Hardie, Andrew a.hardie at lancaster.ac.uk
Sat Jan 14 20:04:11 CET 2012


Hi Richard,

cwb-encode only decodes the standard XML entities: gt, lt, amp, quot and apos. Anything else is not part of XML per se but is defined by some particular DTD. 

If you want characters to be represented as single characters rather than entities in the index, you need to represent them as single characters in the input data. This shouldn't be a problem in UTF8 mode even if you have really weird characters. If the Apache function you're using doesn't do what you want, then you need to find a new function!

All p-attributes are encoded in exactly the same way.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Richard Eckart de Castilho
Sent: 14 January 2012 18:26
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] How to properly encode XML-like tokens?

Thank you for your support, Andrew.

I am now escaping the tokens and that seems to work fine, I no longer get any of these error messages. But it seems like not all XML entities are properly unescaped during indexing. For example the off-the-shelf escaping method that I use (Apache Commons Lang StringEscapeUtils.escapeXml(String) also escapes non-ASCII characters like German umlauts and I currently end up with XML entities in the index. 

Which XML entities are supported by cwb-encode?
What fields do I need to escape? Only the word or also the attribute values?

Best,

-- Richard

Am 13.01.2012 um 10:43 schrieb Hardie, Andrew:

> As standard for XML:
> 
> <
> 
> best
> 
> Andrew.
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Richard Eckart de Castilho
> Sent: 13 January 2012 08:29
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] How to properly encode XML-like tokens?
> 
> Hello Andrew,
> 
> sound like what I need. But how would I escape a literal "&lt;" so it doesn't become a "<" in the index?
> 
> Best,
> 
> -- Richard
> 
> Am 13.01.2012 um 02:03 schrieb Hardie, Andrew:
> 
>> Hi Richard,
>> 
>> Yes indeed there is a way to do this: &lt;text&gt; . The entities will be replaced by literal characters in the index iff you use the -x option with cwb-encode.
>> 
>> best
>> 
>> Andrew.
>> 
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Richard Eckart de Castilho
>> Sent: 12 January 2012 23:39
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] How to properly encode XML-like tokens?
>> 
>> Hello,
>> 
>> I would like to know if there is a proper way to encode corpora with arbitrary tokens, in particular such that look like XML.
>> For example, if I have a real token <RLS> in my corpus, I messages like these:
>> 
>> 	s-attribute <RLS> not declared, inserted literally (input line #80094547, warning issued only once).
>> 
>> In this case it is rather an esthetic problem, but I also sometimes have tokens that are equal to structural tags , e.g. <text>.
>> 
>> Is there some way I can escape such XML-like token in the input to cwb-encode, so that such messages are avoided but the tokens are still properly indexed and searchable as "<text>".
>> 
>> Best regards,
>> 
>> -- Richard
>> 
>> --
>> -------------------------------------------------------------------
>> Richard Eckart de Castilho
>> Technical Lead
>> Ubiquitous Knowledge Processing Lab (UKP-TUD) 
>> FB 20 Computer Science Department      
>> Technische Universität Darmstadt
>> Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 eckartde at tk.informatik.tu-darmstadt.de
>> www.ukp.tu-darmstadt.de
>> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
>> ------------------------------------------------------------------- 

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list