[CWB] [PATCH] cwb-encode checking registry directory

Stefan Evert stefanML at collocations.de
Wed Feb 10 00:30:24 CET 2010


¡Hola!

Thanks for the patch.  I had to / wanted to change a couple of things  
this time.

> - moved the code that checks corpus directory earlier (where I added  
> it
> was too late on some cases)
>
> - added code to check if the registry directory exists (if not,
> complain and abort)

That's a misunderstanding: -R specifies the full path (directory +  
filename) of the registry entry to be created, not just the registry  
directory.  So registry_file must _not_ be a directory, and in normal  
usage it will often not exist.  I've moved the test up to the filename  
validity check, where I temporarily shorten the string to the  
directory part only.

> - added code to check if the registry directory (the last portion,  
> that
> is) just includes lowercase letters, digits or underscores (let me  
> know
> to enlarge this set)

I've made the check less strict, as the CWB traditionally allowed  
almost everything (except uppercase letters) in the registry filenames  
and I don't want to break backward compatibility.  cwb-encode only  
aborts if it detects an uppercase letter (or some other characters  
known to be problematic); but it will issue a warning if the filename  
is not in canonical format (only a-z, 0-9, _ and -).

> - added code to check if the registry directory was supplied (if not,
> complain and abort)

I've removed this check, also for backward compatibility.  Some users  
may have build systems that generate the registry entry beforehand and  
then run cwb-encode without -R.  This certainly makes sense if you  
want to add information that cwb-encode doesn't generate, such as  
charset (until recently), alignment attributes, etc.

Patch with changes has been committed.

Cheers,
Stefan





More information about the CWB mailing list