[CWB] Spanish TreeTagger

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Apr 30 11:47:23 CEST 2017


Hi Simon,

this is a design issue with the regular expressions used in CQPweb. To explain: [UNREADABLE] appears in the display when the regex used to extract words from the CQP concordance is unable to parse a particular token in the concordance.

In this case, the reason is that the concordance contains

...por el contrario/ADV ...

which is split into 3 words

por
el
contrario/ADV

of which only the third is well-formed according to CQPweb's expectations (that each word-token will be followed by / and then a tag). SO the first two render as [UNREADABLE].

The fundamental problem is that the space, which here occurs within tokens, is also used as the token-divider in CQP concordances. CQPweb shouldn't be breaking up "por el contrario" but it has to because space is the between-token character.

And the *more* fundamental problem is that CQPweb is designed to work with the human-readable CQP concordance rather than with an unambiguously parseable representation of the concordance (e.g. XML). This is on the list to fix in CWB v4 when we revamp the CQP concordance print modes.

In the meantime, you can bodge this by replacing the space in multiword tokens in the input data with some other character e.g. _ which would then give you 

... por_el_contrario/ADV ....

which would, I believe, be correctly extracted as a single word-and-tag.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Meier-Vieracker, Simon
Sent: 30 April 2017 10:04
To: Open source development of the Corpus WorkBench
Subject: [CWB] Spanish TreeTagger

Sorry for posting a question not concerning CQP in the first place but the TreeTagger for Spanish texts:

Using the script „tree-tagger-spanish“ a list of multiword expressions is included in the tagging procedure, e.g. printing 

> Por el contrario	ADV	por~el~contrario

For CQPweb has problems with this and displays it as "[UNREADABLE] [UNREADABLE] contrario“ I wonder if I should to a normal tokenizing. However, I am not sure whether users familiar with tagged Spanish texts will expect „por el contrario“ as a multiword token (for my part, I don’t speak Spanish).

Or is this a bug of my CQPweb v3.2.27?

Best, Simon


-------

Dr. Simon Meier

Technische Universität Berlin
Institut für Sprache und Kommunikation
Fachgebiet Allgemeine Linguistik
Sekretariat H42
Straße des 17. Juni 135, 10623 Berlin
+49 (0) 30 314 22323
simon.meier at tu-berlin.de
http://www.linguistik.tu-berlin.de/menue/mitarbeiterinnen/wiss_mitarbeiterinnen/simon_meier/




_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list