[CWB] [cwb:feature-requests] #48 CQPweb: add control for rendering of spacing around punctuation

Stefan Evert schtepf at users.sf.net
Sat Jul 1 15:00:10 CEST 2017


I would strongly argue in favour of the BNCweb solution! I think that corpus designers should be encouraged to preserve the original whitespace in the tokenization phase – good tokenizers have an option to do this.


---

** [feature-requests:#48] CQPweb: add control for rendering of spacing around punctuation**

**Status:** open
**Group:** TODO-4.0
**Labels:** CQPweb 
**Created:** Tue Dec 25, 2012 06:35 AM UTC by Andrew Hardie
**Last Updated:** Tue Dec 25, 2012 06:35 AM UTC
**Owner:** Andrew Hardie


Currently, punctuation marks are always displayed with spaces before-and-after in concordance and extended-context view (because they are indexed as separate tokens).

It would be possible, however, to add functionality to have punctuation marks display without the bounding spaces. There are two possible ways to do this:

(1) the way BNCweb does it - by having separate p-attributes encoding whether or not there is a space adjacent. This increases the complexity of setup and requires more disk space.

(2) an alternative which would use less disk space but would make concordance rendering take (milliseconds) longer - allow a regex to be specified and make the addition of an intervening space conditional on the regex being (not) matched. For example, \W+ could be used for English, so that any token made up solely of punctuation marks would be attached to the token preceding. This would allow the visualisation to be correct in most but not all cases. (e.g. quote marks). Separate regexen could control space before after; this would also allow Chinese data to be rendered without any intervening spaces, for instance, by setting the regex to .+.

This is currently not  a high priority feature as it is merely cosmetic.


---

Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/feature-requests/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/feature-requests/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170701/51e502bf/attachment.html>


More information about the CWB mailing list