[CWB] Spanish TreeTagger

"Andrés Chandía" andres at chandia.net
Sun Apr 30 12:16:46 CEST 2017



Sorry for sticking my nose into this, but I, as Spanish speaker, don't see the point on
having,
at least in this case, a multi word construction, in English you would say
"on the contrary", and I doubt
there is a multi word construction for that
expression, in case you need it, I've seen and worked
with some corpora that take multi
word constructions into another level, that regarding structural
tagging, for
instance:

<s>
...
palabra
...
<expression>
por
el
contrario
</expression>
....
palabra
....
</s>

I hope my comment helps....
 

Hi Andrew,  thanks
for this explanation!  However, besides to the more technical details, my issue is still a bit
different: For I will not use the Spanish corpora by myself in the first place, I just want to
be sure what prospective users are going to expect when they use a Spanish corpus tagged with
the TreeTagger.  To me the best and most simple solution seems to be a normal tokenizing with
every word as one token, since it is a rather arbitrary matter what will count as a multiword
expression. But if tagging multi word expressions is standard in the Spanish-speaking
community I will bow to the majority…  Best, Simon  > Am
30.04.2017 um 11:47 schrieb Hardie, Andrew <a.hardie at lancaster.ac.uk>:
 >   > Hi Simon,  >   > this is a design issue with the
regular expressions used in CQPweb. To explain: [UNREADABLE] appears in the display when the
regex used to extract words from the CQP concordance is unable to parse a particular token in
the concordance.  >   > In
this case, the reason is that the concordance contains  > 
 > ...por el contrario/ADV ...  >   > which is split into 3 words 
>   > por  > el  > contrario/ADV  >   > of which only the third is
well-formed according to CQPweb's expectations (that each word-token will be followed by / and
then a tag). SO the first two render as [UNREADABLE].  > 
 > The fundamental problem is that the space, which here
occurs within tokens, is also used as the token-divider in CQP concordances. CQPweb shouldn't
be breaking up "por el contrario" but it has to because space is the between-token
character.  >   > And the
*more* fundamental problem is that CQPweb is designed to work with the human-readable CQP
concordance rather than with an unambiguously parseable representation of the concordance
(e.g. XML). This is on the list to fix in CWB v4 when we revamp the CQP concordance print
modes.  >   > In the meantime,
you can bodge this by replacing the space in multiword tokens in the input data with some
other character e.g. _ which would then give you   >  
> ... por_el_contrario/ADV ....  > 
 > which would, I believe, be correctly extracted as a single
word-and-tag.  >   > best
 >   > Andrew.  >   > -----Original Message----- 
> From: cwb-bounces at sslmit.unibo.it
[mailto:cwb-bounces at sslmit.unibo.it]
On Behalf Of Meier-Vieracker, Simon  > Sent: 30 April 2017
10:04  > To: Open source development of the Corpus WorkBench
 > Subject: [CWB] Spanish TreeTagger  >   > Sorry for posting a question not
concerning CQP in the first place but the TreeTagger for Spanish texts:  >   > Using the script
„tree-tagger-spanish“ a list of multiword expressions is included in the tagging
procedure, e.g. printing   >   >> Por el contrario        ADV        por~el~contrario  >   > For CQPweb has problems with this and
displays it as "[UNREADABLE] [UNREADABLE] contrario“ I wonder if I should to a
normal tokenizing. However, I am not sure whether users familiar with tagged Spanish texts
will expect „por el contrario“ as a multiword token (for my part, I don’t
speak Spanish).  >   > Or is
this a bug of my CQPweb v3.2.27?  >   > Best, Simon  >   >   > -------  >   > Dr. Simon Meier  >   > Technische Universität Berlin
 > Institut für Sprache und Kommunikation  > Fachgebiet Allgemeine Linguistik  >
Sekretariat H42  > Straße des 17. Juni 135, 10623 Berlin
 > +49 (0) 30 314 22323  > simon.meier at tu-berlin.de
 > http://www.linguistik.tu-berlin.de/menue/mitarbeiterinnen/wiss_mitarbeiterinnen/simon_meier/
 >   >   >   >   >
_______________________________________________  > CWB mailing
list  > CWB at sslmit.unibo.it
 > http://liste.sslmit.unibo.it/mailman/listinfo/cwb  > _______________________________________________  > CWB mailing list  > CWB at sslmit.unibo.it
 > http://liste.sslmit.unibo.it/mailman/listinfo/cwb   -------  Dr.
Simon Meier  Technische Universität Berlin Institut für Sprache und Kommunikation
Fachgebiet Allgemeine Linguistik Sekretariat H42 Straße des 17. Juni 135, 10623 Berlin
+49 (0) 30 314 22323 simon.meier at tu-berlin.de
http://www.linguistik.tu-berlin.de/menue/mitarbeiterinnen/wiss_mitarbeiterinnen/simon_meier/
    _______________________________________________ CWB mailing list CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb




_______________________

            andrés
chandía

NMT |
Dungupeyem | Corlexim

administrador de:
Parles.upf | Amind
terapia | Mapuche koyaktu | Nocando |
mail: ONG Mapuche koyaktu | Psicoaching |
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170430/bf78ea66/attachment.html>


More information about the CWB mailing list