[CWB] CQPweb: Suggestions for improvement (categories, POST)

Noah Bubenhofer bubenhofer at cl.uzh.ch
Thu Mar 9 08:51:50 CET 2017


>>>> 
> But perhaps it's more a design issue? It would be very helpful to be more flexible concerning category values. We always use this very nice function of getting the meta data from the XML. But then the documents do not always meet the handles rules of cqpweb. Maybe it would be a solution to let cqpweb internally convert the values into hashes and have a table in the data base storing the value hash pairs?
> <<<
> 
> Alas, no. The solution is to preprocess your input files to make sure any XML attributes you want to use as classification fields contain only valid category handles...

sorry, this is not a solution, that's a workaround. Just to give you an impression of our workflow: we index our corpora directly in the cwb with cwb-encode and normally work with the cwb on the corpora. All our meta data is in the xml headers of the texts. As you know, in the cwb using the cqp group statement, there is no problem in having the whole range of characters, blanks etc. as values of the categories to use for grouping.

Then sometimes we also index the cwb-corpus in cqpweb to have a nice interface to access it. And then we have all the problems with category values that are perfectly valid in the xml standard in xml attributes and also can be used in the cwb. Sure, I could then reprocess the corpus, but I want to avoid that. It would be great to come as close as possible to the ideal, that if a corpus works in the cwb it also works in cqpweb.

Also think of non english corpora where it is very common to have accented characters, umlauts etc. as values for categories.

I'm fully aware that improving the category handling in cqpweb is quite some work and I really appreciate the big effort you made and make in developing cqpweb, but I'm a bit unhappy if you qualify this issue as a non issue (if I do not overlook anything which could ease our work flow).

best,
Noah


> 
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Noah Bubenhofer
> Sent: 08 March 2017 19:03
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] CQPweb: Suggestions for improvement (categories, POST)
> 
>> Am 08.03.2017 um 18:03 schrieb Hardie, Andrew <a.hardie at lancaster.ac.uk>:
>> 
>>>>> 1. The rules for handles to build categories for the meta data from the xml text values is very strict.
>> 
>> This is 100% intentional and by design. Handles may have to be used in contexts where things that are not A-Za-z0-9_ can break parsing and stop the handle being identified properly. (EG in the internal serialisation of query restrictions, query postprocesses, other things....)
>> 
>>>> I did'n experience any problems with this solution.
>> 
>> More precisely, you haven't experienced any problems *yet*.
>> 
>> Use this solution at your own risk. I am not going to change the rules for handles.
> 
> ok, I see. Well, we're working since several years with "hacked" versions of cqpweb without any problems. But of course, it's a bit risky...
> 
> But perhaps it's more a design issue? It would be very helpful to be more flexible concerning category values. We always use this very nice function of getting the meta data from the XML. But then the documents do not always meet the handles rules of cqpweb. Maybe it would be a solution to let cqpweb internally convert the values into hashes and have a table in the data base storing the value hash pairs?
> 
>>>> Creating the metadata table from corpus XML annotations does not work if there are a lot of different xml annotations......... I think the better solution would be to use POST instead of GET in the form
>> 
>> Quite correct. I do use POST for some forms, but I'd not realised this one where that would be useful. I've made the change in my copy, it'll be committed anon.
> 
> great, thanks a lot!
> 
> Best,
> Noah
> 
> 
>> 
>>>> And I changed all the necessary $_GET strings in metadata-admin.inc.php to $_POST.
>> 
>> Not necessary. See lines 798-801 of environment.inc.php
>> 
>> best
>> 
>> Andrew.
>> 
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Noah Bubenhofer
>> Sent: 08 March 2017 16:48
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] CQPweb: Suggestions for improvement (categories, POST)
>> 
>> Hi,
>> 
>> after having installed revision 931 of cqpweb and some experience with CQPweb (the older versions...) I have the following suggestions to improve the software:
>> 
>> 1. The rules for handles to build categories for the meta data from the xml text values is very strict. I have changed the following:
>> 
>> xml.inc.php:439:
>>       $test = '|^[\w _äöüÄÖÜßáàâéèêíìîóòôúùûçÇ\/\-]{0,' . $maxbytes . '}$|';
>> 
>> admin-lib.inc.php:656:
>>       $result = do_mysql_query("select distinct `$field` from text_metadata_for_$corpus where `$field` REGEXP '[^A-Za-z0-9_ äöüÄÖÜßáàâéèêíìîóòôúùûçÇ\-\/]'");
>> 
>> I did'n experience any problems with this solution. There might be a more elegant solution instead of naming all the accented characters.
>> 
>> 2. Creating the metadata table from corpus XML annotations does not work if there are a lot of different xml annotations. This is due to the very long URI which results from the get form. Of course it is possible to reconfigure the web server allowing longer URIs, but I think the better solution would be to use POST instead of GET in the form.
>> 
>> I changed:
>> indexforms-admin.inc.php:830
>> <form action="metadata-admin.php" method="post">
>> 
>> And I changed all the necessary $_GET strings in metadata-admin.inc.php to $_POST.
>> 
>> Best,
>> Noah
>> 
>> 
>> 
>> 
>> Universität Zürich
>> Institut für Computerlinguistik
>> Projekt "Visual Linguistics"
>> 
>> Andreasstrasse 15
>> CH-8050 Zürich
>> 
>> www.bubenhofer.com
>> www.visual-linguistics.net
>> bubenhofer at cl.uzh.ch (PGP-Schlüssel vorhanden)
>> Tel. +41 44 635 67 18
>> Büro 2.18
>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> 
> Universität Zürich
> Institut für Computerlinguistik
> Projekt "Visual Linguistics"
> 
> Andreasstrasse 15
> CH-8050 Zürich
> 
> www.bubenhofer.com
> www.visual-linguistics.net
> bubenhofer at cl.uzh.ch (PGP-Schlüssel vorhanden)
> Tel. +41 44 635 67 18
> Büro 2.18
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb

Universität Zürich
Institut für Computerlinguistik
Projekt "Visual Linguistics"

Andreasstrasse 15
CH-8050 Zürich

www.bubenhofer.com
www.visual-linguistics.net
bubenhofer at cl.uzh.ch (PGP-Schlüssel vorhanden)
Tel. +41 44 635 67 18
Büro 2.18

-------------- n�chster Teil --------------
Ein Dateianhang mit Bin�rdaten wurde abgetrennt...
Dateiname   : signature.asc
Dateityp    : application/pgp-signature
Dateigr��e  : 842 bytes
Beschreibung: Message signed with OpenPGP using GPGMail
URL         : <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170309/b007ca19/attachment.sig>


More information about the CWB mailing list