[CWB] [ cwb-Feature Requests-2806338 ] CQPweb: XML support

SourceForge.net noreply at sourceforge.net
Mon Jun 15 01:58:19 CEST 2009


Feature Requests item #2806338, was opened at 2009-06-14 23:58
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2806338&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQPweb
Group: None
Status: Open
Priority: 5
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: CQPweb: XML support

Initial Comment:
This is the big enhancement for version 3.0: many, MANY users have asked for it.

Just as the "text-based restrictions" parallel the "written text restrictions" in BNCweb, so the "XML-based restrictions" will need to parallel the "utterance-by-speaker-type" system in BNCweb.

Each XML span (ie s-attribute) which is to be covered in this way (and note, not all of the XML in a given corpus needs to be) will need to be identified by the combination of (a) an element-name (b) some given attribute. Its "is" in the database will then look a bti like this:

xml_metadata_for_CORPUSNAME [parallel to text_metadata_for_CORPUSNAME]
id          gender   class     ...      CQPbegin   CQPend
-----------------------------------------------------------
u|who|S933  m        AB        ...      \d\d\d\d   \d\d\d\d

Boite, however, this kind of "natural" system for XML identifiers won't work, because the XML segment is not *uniquely* identified. Two solutions:
(1) allow CQPbeing and CQPend to contain *multiple* cwb-indexes
(2) enforce uniqueness of XML elements - so "who" could not be used for u, but "id" could be.

Neither of these is entirely satisfactory and this needs careful thinking about.

Also note that every different s-attribute will require (a) a different set of CWB-frequency indexes and (b) a separate set of frequency tables . This function will be **VERY** hungry of disk space.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2806338&group_id=131809


More information about the CWB mailing list