[CWB] How to properly encode XML-like tokens?

Richard Eckart de Castilho eckart at ukp.informatik.tu-darmstadt.de
Sat Jan 14 19:25:43 CET 2012


Thank you for your support, Andrew.

I am now escaping the tokens and that seems to work fine, I no longer get any of these error messages. But it seems like not all XML entities are properly unescaped during indexing. For example the off-the-shelf escaping method that I use (Apache Commons Lang StringEscapeUtils.escapeXml(String) also escapes non-ASCII characters like German umlauts and I currently end up with XML entities in the index. 

Which XML entities are supported by cwb-encode?
What fields do I need to escape? Only the word or also the attribute values?

Best,

-- Richard

Am 13.01.2012 um 10:43 schrieb Hardie, Andrew:

> As standard for XML:
> 
> <
> 
> best
> 
> Andrew.
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Richard Eckart de Castilho
> Sent: 13 January 2012 08:29
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] How to properly encode XML-like tokens?
> 
> Hello Andrew,
> 
> sound like what I need. But how would I escape a literal "&lt;" so it doesn't become a "<" in the index?
> 
> Best,
> 
> -- Richard
> 
> Am 13.01.2012 um 02:03 schrieb Hardie, Andrew:
> 
>> Hi Richard,
>> 
>> Yes indeed there is a way to do this: &lt;text&gt; . The entities will be replaced by literal characters in the index iff you use the -x option with cwb-encode.
>> 
>> best
>> 
>> Andrew.
>> 
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Richard Eckart de Castilho
>> Sent: 12 January 2012 23:39
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] How to properly encode XML-like tokens?
>> 
>> Hello,
>> 
>> I would like to know if there is a proper way to encode corpora with arbitrary tokens, in particular such that look like XML.
>> For example, if I have a real token <RLS> in my corpus, I messages like these:
>> 
>> 	s-attribute <RLS> not declared, inserted literally (input line #80094547, warning issued only once).
>> 
>> In this case it is rather an esthetic problem, but I also sometimes have tokens that are equal to structural tags , e.g. <text>.
>> 
>> Is there some way I can escape such XML-like token in the input to cwb-encode, so that such messages are avoided but the tokens are still properly indexed and searchable as "<text>".
>> 
>> Best regards,
>> 
>> -- Richard
>> 
>> --
>> -------------------------------------------------------------------
>> Richard Eckart de Castilho
>> Technical Lead
>> Ubiquitous Knowledge Processing Lab (UKP-TUD) 
>> FB 20 Computer Science Department      
>> Technische Universität Darmstadt
>> Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 eckartde at tk.informatik.tu-darmstadt.de
>> www.ukp.tu-darmstadt.de
>> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
>> ------------------------------------------------------------------- 



More information about the CWB mailing list