[CWB] Field Word Data (ELAN)

Stefan Evert stefanML at collocations.de
Mon Dec 12 11:17:57 CET 2016


Depending on how you want to use the corpus, it might also make sense to split the text into morphemes as tokens and use an s-attribute to identify complete words.  This will be unnatural if you write CQP queries directly and it wouldn't play well with CQPweb's sorting and collocation functions, but if you design your own Web interface, much of the complexity can be hidden.

In your example, this encoding would look as follows

<s trans="The pirate has a beard">
<w orth="pirat-a">
pirat	pirate	NOUN
a	NOM	NOM
</w>
<w orth="barb-am">
barb	beard	NOUN
am	ACC	ACC
</w>
<w orth="hab-et">
hab	have	VERB
et	3SG	3SG
</w>
</s>

In theory, the Ziggurat data model can deal with such multiple levels of tokenization much more naturally, but we don't envisage support at the CQP / CQPweb level (which would fundamentally change assumptions made by these tools).

Best,
Stefan


> On 10 Dec 2016, at 11:00, Ruprecht von Waldenfels <ruprecht.waldenfels at gmx.net> wrote:
> 
> I wonder how to deal with multiple lines of glossing that are dependent on each other, e.g.,
> 
> Pirat-a    barb-am   hab-etpirate-NOM beard-ACC have-3SGNOUN-NOM NOUN-ACC VERB-3SG"The pirate has a beard"
> This is a silly example, of course, but it highlights the problem: in an id eal world, I would like to be able to query for word forms that involve a morpheme with the NOUN 'pirate', i.e., utilizes the alignment within the glosses. This could be done by adding a further p-attribute that offers a set, e..,
> 
> <s trans="The pirate has a beard">pirat-a  pirate-NOM 3SG NOUN-NOM |pirat:pirate:NOUN|a:NOM:NOM|barb-am  beard-ACC NOUN-ACC	|barb:beard:NOUN|am:ACC:ACC|hab-et   have-3SG VERB-3SG      |hab:have:verb|et:3SG:3SG|</s>
> This would allow me to easily search for, say, a morpheme 'et' that is a third person singular marker without having to specify its position in the glossed word form. I realize the third level is not very functional here, but it stands for the (real possibility) of multiple glosses that relate to each other.
> 
> Any of these solutions is not very elegant, it seems to me - they merely succeed in making searches possible; but I cannot think of any better way.
> 



More information about the CWB mailing list