[CWB] Suggestion: user intervention in constructing an index

Mon Mar 26 01:06:38 CEST 2018

Hi Ciáran,

David has explained a number of the points while I’ve not been able to reply, so I shall try not to repeat things he’s already said while replying now!

>> It will avoid having a permanent multi-column file outside the corpus, but won't the multiple columnsstill exist internally in some form within the corpus?  :-(

Yes, but it has to. If you want to store more than one item of separately-searchable information about each token – in this case, your word/demut combination – then you have to have multiple attributes.

If you want to avoid at all costs multiple attributes being stored under the hood then…. you don’t want to use CWB! (Or Manatee, since that works on precisely the same principle.)

In re,
...  you can address the second point (of rendering) by writing a display program which lays things out to your liking using one of the interface libraries i.e. the CWB-Perl modules or the cqp.inc.php module from CQPweb.  Or, if you prefer, just write your rendering script to pipe text in and out of a cqp slave instance (which is what the Perl and PHP libraries do behind the scenes).

I'm not sure whether these two things— the additional binary attribute, and CWB-Perl — are two independent suggestions, or two aspects of the same suggestion.

They are two aspects of the same suggestion as they are two different ways of putting a custom interface layer in between CQP and the user. You can either use the libraries, or control CQP in slave mode directly. Either would make it possible for you to apply modifications to the CQP display if you wanted to do so.

>> Where can I get info about binary p-attributes?

By a “binary p-attribute” I simply mean a p-attribute which only contains 2 distinct values. It does matter what they are: T/F, 1/0, whatever.

You would access such an attribute by turning on its display in the CQP concordance. Each word would then be followed by a value that shows whether it has an orthographic space after it or not. For instance:

>>This/1 is/0 n’t/1 funny/0 .

… and then you program your intermediate script to convert that to

>>This isn’t funny.

IE making isn’t appear without an orthographic space after, even though it is 2 tokens in the index, and removing orthographic space before punctuation.

This is approximately how BNCweb does it (I’m writing from memory so I may not have the detail right, but this is the principle.)

If I need to use CWB-Perl, or if using it would make things easier, I notice that the README in CWB-Perl 2.2.102 mentions "cwb-config", but https://github.com/cran/rcqp/blob/master/src/cwb/man/cwb-config.pod says that cwb-config is not yet available for Windows.

cwb-config is very Unix-specific; Unix build systems use these kinds of programs to find out what kind of system they are running on. I haven’t worked out what – if anything – would be the equivalent on Win. If/when I do I’ll add it!

Also compilation of the Perl modules on Windows is something I haven’t sussed out yet. The documentation will be updated when I do.  (Incidentally, though, the version of CWB hosted within the rcqp repo does not seem to be remotely up to date – the commits are flagged as 6 years old. So that copy of the man will probably stay as-is for ever!)

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 21 March 2018 10:38
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Suggestion: user intervention in constructing an index

Thanks again Andrew.
>> I am not comfortable with the idea of storing two columns to hold things which (unlike with normal lemmatisation) can be automatically generated from one column — during the indexing process, if access by a user-supplied script were usable there, acting on the text shown in column 1 to produce what is shown in column 2.

But as I’ve explained, there is already a way to do that if you don’t want a permanent multi-column file – just put your user script into a pipeline with cwb-encode on the end. IE:

  *   cat one-col-file |  column-transform-script | cwb-encode [options]
OK, I had thoughtyour pipeline suggestion applied only to your first answer (transforming "word"), but I see now that it can apply to the second answer too (transform "word" and add "lemma").  Pipelining is not something I have worked with in Windows/DOS, but I assume it will be feasible.

It will avoid having a permanent multi-column file outside the corpus, but won't the multiple columnsstill exist internally in some form within the corpus?  :-(

Some display systems like BNCweb remove non-original orthographic spaces from the CQP concordance. (BNCweb does this by having an additional binary p-attribute storing the “orthographic-space-after” data) ...

...  you can address the second point (of rendering) by writing a display program which lays things out to your liking using one of the interface libraries i.e. the CWB-Perl modules or the cqp.inc.php module from CQPweb.  Or, if you prefer, just write your rendering script to pipe text in and out of a cqp slave instance (which is what the Perl and PHP libraries do behind the scenes).

I'm not sure whether these two things— the additional binary attribute, and CWB-Perl — are two independent suggestions, or two aspects of the same suggestion.

I'm definitely interested in copying the BNCweb idea.  Where can I get info about binary p-attributes?  Where should I look to find out about reading this attribute from a script or program?

If I need to use CWB-Perl, or if using it would make things easier, I notice that the README in CWB-Perl 2.2.102 mentions "cwb-config", but https://github.com/cran/rcqp/blob/master/src/cwb/man/cwb-config.pod says that cwb-config is not yet available for Windows.

Regards,
Ciarán.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180325/981906e1/attachment-0001.html>