[CWB] charset parameter in registry produces cqp errors

Stefan Evert stefan.evert at uos.de
Mon Jul 30 21:11:11 CEST 2007


>

Hi Dieter, hi everyone!

> this is my first mail to this list so please forgive me if this has
> already been answered before,

Don't worry, I don't think we've had that question before, and it's a  
good opportunity to talk about it.

> and many thanks for this great piece of
> software!
>
> I got a problem concerning the "charset" parameter in the registry
> files: When I enable the "charset = iso-8859-2" option (aka remove the
> trailing "##::") then I get the following error in the cqp logs:
>

That's a misunderstanding: the parameter hasn't been commented out,  
but rather the "##::" is a special type of comment that allows you to  
embed arbitrary metadata (in the form of key-value pairs) in a  
registry file.  I introduced that syntax when I started building new  
CQP versions but wasn't really sure whether they would be stable, so  
I needed extensions that were fully backwards compatible (so that the  
older, stable version of CQP would still accept the new registry  
syntax).

All you have to do is keep the "##::" and change the charset value to  
"latin2" (CQP won't understand iso-8859-2), like so:

##:: charset = "latin2"

Unfortunately ...

> 2) Is the charset parameter still only "informational" like Stefan  
> wrote
> in http://devel.sslmit.unibo.it/pipermail/cwb/2007-February/ 
> 000065.html

... it's still purely informational, as you suspected.  The intention  
is that future CWB versions will honour the charset setting and  
adjust the %c and %d operators accordingly (and deactivate the  
special latin1-specific latex escapes allowed in strings).  I don't  
think there's much more that could be done sensibly (CQP doesn't use  
local-specific collations when sorting, and I'm not really keen to  
get into that mess ...).

Actually, this is something that volunteers who want to get started  
on CQP hacking could contribute very easily, as it only requires  
encapsulated changes to a small set of functions in a single file.   
Any takers?  I'll be happy to point you towards the relevant bits of  
code and explain the architecture of the character mapping system. :o)

Best to all of you,
Stefan

PS: On a bright node, I feel rather confident at the moment that I'll  
be ready to release the source code on sf.net some time in August ...


------------------------------http://cwb.sourceforge.net/----------

[no corpus]> DEWAC;
DEWAC> [pos="ART"] ([pos="ADV|ADJD"]? [pos="ADJA"] ",|und|oder"?)*  
[pos="NN"];
140400753 matches. Use 'cat' to show.

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]





More information about the CWB mailing list