[CWB] Problems with TreeTagger Portuguese

Tue Aug 28 16:49:46 CEST 2018

Dear Simon,
If you want to do serious work with Portuguese, I suggest that you
also look at the work done by Eckhard Bick (PALAVRAS parser), and look
at the POS-tagged corpora available from Linguateca. You can contact
me off-list for more info on both.
Diana
Hardie, Andrew <a.hardie  lancaster.ac.uk> escreveu no dia terça,
28/08/2018 à(s) 15:05:
>
> Hi Simon,
>
> The tagsets are FreeLing-based and Universal Dependencies-based in your two examples respectively. Off the top of my head I do not know which is more widely used for Portuguese. The SketchEngine PtTenTen and Mark Davies' corpusdoportugues.org are both POS tagged .... but each of those uses yet another tagset!
>
> In sum I suggest that you are probably safe with either tagset, unless your users have prior experience of one or the other, in which case obviously prefer that one.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces  sslmit.unibo.it <cwb-bounces  sslmit.unibo.it> On Behalf Of Meier-Vieracker, Simon
> Sent: 28 August 2018 12:51
> To: Open source development of the Corpus WorkBench <cwb  sslmit.unibo.it>
> Subject: Re: [CWB] Problems with TreeTagger Portuguese
>
> Thanks, Stefan, for these hints!
>
> Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...
>
> So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.
>
> Best, Simon
>
> > Boa   AQ0     bom
> > tarde NCFS    tarde
> > .     Fp      .
> > É     VMI     ser
> > este  PD0     este
> > o     DA0     o
> > dia   NCMS    dia
> > :     Fd      :
> > Benfica       NCMS    Benfica
> > e     CC      e
> > FC    NCMS    FC
> > Porto VMI     portar
> > defrontam-se  NCFS    defrontam-se
> > a     SPS     a
> > partir        VMN     partir
> > das   SPS     de+as
> > 18    Z       @card@
> > horas NCFP    hora
> > no    SP+DA   em+o
> > Estádio       NCMP    Estádio
> > da    SPS     de+a
> > Luz   NCMP    Luz
> > ,     Fc      ,
> > no    SP+DA   em+o
> > clássico      NCMS    clássico
> > que   PR0     que
> > vale  VMI     valer
> > a     DA0     o
> > liderança     NCFS    liderança
> > .     Fp      .
>
> and
>
> > Boa   ADJ.Fem.Sing    bom
> > tarde NOUN.Fem.Sing   tarde
> > .     PUNCT.Sent      .
> > É     AUX.Fin.Sing    ser
> > este  DET.Masc.Sing   este
> > o     DET.Masc.Sing   o
> > dia   NOUN.Masc.Sing  dia
> > :     PUNCT.Colon     :
> > Benfica       PROPN.Masc.Sing Benfica
> > e     CCONJ   e
> > FC    PROPN.Masc.Sing FC
> > Porto PROPN.Sing      Porto
> > defrontam-se  VERB.Fin.Sing   defrontam-se
> > a     ADP     a
> > partir        NOUN    partir
> > das   ADP_DET.Fem.Plur        de_o
> > 18    NUM     18
> > horas NOUN.Fem.Plur   hora
> > no    ADP_DET.Masc.Sing       em_o
> > Estádio       PROPN.Masc.Sing Estádio
> > da    ADP_DET.Fem.Sing        de_o
> > Luz   PROPN.Sing      Luz
> > ,     PUNCT.Comma     ,
> > no    ADP_DET.Masc.Sing       em_o
> > clássico      NOUN.Masc.Sing  clássico
> > que   PRON.Rel.Masc.Sing      que
> > vale  VERB.Fin.Sing   valer
> > a     DET.Fem.Sing    o
> > liderança     NOUN.Fem.Sing   liderança
> > .     PUNCT.Sent      .
>
>
> ---
>
> > Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML  collocations.de>:
> >
> > This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
> >
> > Most other TreeTagger wrappers now use a Perl script
> >
> >       TOKENIZER=${CMD}/utf8-tokenize.perl
> >
> > instead of the flex tokenizer
> >
> >       TOKENIZER=${BIN}/separate-punctuation
> >
> > Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer?  It probably depends on whether you need the clitics- and contraction-splitting in
> >
> >       SPLITTER=${CMD}/portuguese-splitter.perl
> >
> > which the alternative parameter file doesn't seem to do.
> >
> > Best,
> > Stefan
> >
> >
> >
> >
> >> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier  tu-berlin.de> wrote:
> >>
> >>
> >> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
> >>
> >> ---
> >>
> >> #!/bin/sh
> >>
> >> # Set these paths appropriately
> >>
> >> BIN=/Users/Simon/TreeTagger/bin
> >> CMD=/Users/Simon/TreeTagger/cmd
> >> LIB=/Users/Simon/TreeTagger/lib
> >>
> >> TOKENIZER=${BIN}/separate-punctuation
> >> SPLITTER=${CMD}/portuguese-splitter.perl
> >> TAGGER=${BIN}/tree-tagger
> >> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
> >> POST_TAGGING=${CMD}/portuguese-post-tagging
> >> PARFILE=${LIB}/portuguese-utf8.par
> >>
> >> # splitting
> >> $SPLITTER $* |
> >> # pre-tokenization
> >> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
> >> # tokenizing
> >> $TOKENIZER +1 +s +l $ABBR_LIST |
> >> # remove empty lines
> >> grep -v '^$' |
> >> # tagging
> >> $TAGGER $PARFILE -token -lemma -no-unknown -sgml |
> >> $POST_TAGGING -no
> >>
> >
> > _______________________________________________
> > CWB mailing list
> > CWB  sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
> _______________________________________________
> CWB mailing list
> CWB  sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB  sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb