[CWB] Spanish TreeTagger

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 1 01:14:49 CEST 2017


>> but it also complicates things for users, who can never really know how to search for these things until they try and get 0 hits

Not just that – it can create other tangles for unaware users.

EG, if they compare two corpora for keywords, one with multiword merged to one token, one without, then the multiwords will always show up as keywords and the component forms as negative keywords.  I have seen this sort of thing reported as a major finding in more than one undergraduate essay!

I personally prefer to stick to keeping multiword annotation as s-attributes with the components as separate entries in the token stream – as in the BNC World Edition, in fact – which supports users who want to know about them without creating confusion for users who don’t.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 30 April 2017 21:17
To: andres at chandia.net; Open source development of the Corpus WorkBench
Subject: Re: [CWB] Spanish TreeTagger

On Sun, Apr 30, 2017 at 7:16 AM, "Andrés Chandía" <andres at chandia.net<mailto:andres at chandia.net>> wrote:

Sorry for sticking my nose into this, but I, as Spanish speaker, don't see the point on having, at least in this case, a multi word construction, in English you would say

"on the contrary"
Lots of taggers do this. Connexor does it to some extent, and FreeLing seems to interpret everything possible as a multi-word construction. I find that it has its uses, but it also complicates things for users, who can never really know how to search for these things until they try and get 0 hits.

Cheers,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170430/9118c51a/attachment.html>


More information about the CWB mailing list