[CWB] Spanish TreeTagger

Meier-Vieracker, Simon simon.meier at tu-berlin.de
Sun Apr 30 12:07:10 CEST 2017


Hi Andrew,

thanks for this explanation!

However, besides to the more technical details, my issue is still a bit different: For I will not use the Spanish corpora by myself in the first place, I just want to be sure what prospective users are going to expect when they use a Spanish corpus tagged with the TreeTagger.

To me the best and most simple solution seems to be a normal tokenizing with every word as one token, since it is a rather arbitrary matter what will count as a multiword expression. But if tagging multi word expressions is standard in the Spanish-speaking community I will bow to the majority…

Best, Simon

> Am 30.04.2017 um 11:47 schrieb Hardie, Andrew <a.hardie at lancaster.ac.uk>:
> 
> Hi Simon,
> 
> this is a design issue with the regular expressions used in CQPweb. To explain: [UNREADABLE] appears in the display when the regex used to extract words from the CQP concordance is unable to parse a particular token in the concordance.
> 
> In this case, the reason is that the concordance contains
> 
> ...por el contrario/ADV ...
> 
> which is split into 3 words
> 
> por
> el
> contrario/ADV
> 
> of which only the third is well-formed according to CQPweb's expectations (that each word-token will be followed by / and then a tag). SO the first two render as [UNREADABLE].
> 
> The fundamental problem is that the space, which here occurs within tokens, is also used as the token-divider in CQP concordances. CQPweb shouldn't be breaking up "por el contrario" but it has to because space is the between-token character.
> 
> And the *more* fundamental problem is that CQPweb is designed to work with the human-readable CQP concordance rather than with an unambiguously parseable representation of the concordance (e.g. XML). This is on the list to fix in CWB v4 when we revamp the CQP concordance print modes.
> 
> In the meantime, you can bodge this by replacing the space in multiword tokens in the input data with some other character e.g. _ which would then give you 
> 
> ... por_el_contrario/ADV ....
> 
> which would, I believe, be correctly extracted as a single word-and-tag.
> 
> best
> 
> Andrew.
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Meier-Vieracker, Simon
> Sent: 30 April 2017 10:04
> To: Open source development of the Corpus WorkBench
> Subject: [CWB] Spanish TreeTagger
> 
> Sorry for posting a question not concerning CQP in the first place but the TreeTagger for Spanish texts:
> 
> Using the script „tree-tagger-spanish“ a list of multiword expressions is included in the tagging procedure, e.g. printing 
> 
>> Por el contrario	ADV	por~el~contrario
> 
> For CQPweb has problems with this and displays it as "[UNREADABLE] [UNREADABLE] contrario“ I wonder if I should to a normal tokenizing. However, I am not sure whether users familiar with tagged Spanish texts will expect „por el contrario“ as a multiword token (for my part, I don’t speak Spanish).
> 
> Or is this a bug of my CQPweb v3.2.27?
> 
> Best, Simon
> 
> 
> -------
> 
> Dr. Simon Meier
> 
> Technische Universität Berlin
> Institut für Sprache und Kommunikation
> Fachgebiet Allgemeine Linguistik
> Sekretariat H42
> Straße des 17. Juni 135, 10623 Berlin
> +49 (0) 30 314 22323
> simon.meier at tu-berlin.de
> http://www.linguistik.tu-berlin.de/menue/mitarbeiterinnen/wiss_mitarbeiterinnen/simon_meier/
> 
> 
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------

Dr. Simon Meier

Technische Universität Berlin
Institut für Sprache und Kommunikation
Fachgebiet Allgemeine Linguistik
Sekretariat H42
Straße des 17. Juni 135, 10623 Berlin
+49 (0) 30 314 22323
simon.meier at tu-berlin.de
http://www.linguistik.tu-berlin.de/menue/mitarbeiterinnen/wiss_mitarbeiterinnen/simon_meier/






More information about the CWB mailing list