[CWB] Problems with TreeTagger Portuguese
Diana Santos
dianamsmpsantos at gmail.com
Tue Aug 28 16:49:46 CEST 2018
Dear Simon,
If you want to do serious work with Portuguese, I suggest that you
also look at the work done by Eckhard Bick (PALAVRAS parser), and look
at the POS-tagged corpora available from Linguateca. You can contact
me off-list for more info on both.
Diana
Hardie, Andrew <a.hardie lancaster.ac.uk> escreveu no dia terça,
28/08/2018 à(s) 15:05:
>
> Hi Simon,
>
> The tagsets are FreeLing-based and Universal Dependencies-based in your two examples respectively. Off the top of my head I do not know which is more widely used for Portuguese. The SketchEngine PtTenTen and Mark Davies' corpusdoportugues.org are both POS tagged .... but each of those uses yet another tagset!
>
> In sum I suggest that you are probably safe with either tagset, unless your users have prior experience of one or the other, in which case obviously prefer that one.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces sslmit.unibo.it <cwb-bounces sslmit.unibo.it> On Behalf Of Meier-Vieracker, Simon
> Sent: 28 August 2018 12:51
> To: Open source development of the Corpus WorkBench <cwb sslmit.unibo.it>
> Subject: Re: [CWB] Problems with TreeTagger Portuguese
>
> Thanks, Stefan, for these hints!
>
> Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...
>
> So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.
>
> Best, Simon
>
> > Boa AQ0 bom
> > tarde NCFS tarde
> > . Fp .
> > É VMI ser
> > este PD0 este
> > o DA0 o
> > dia NCMS dia
> > : Fd :
> > Benfica NCMS Benfica
> > e CC e
> > FC NCMS FC
> > Porto VMI portar
> > defrontam-se NCFS defrontam-se
> > a SPS a
> > partir VMN partir
> > das SPS de+as
> > 18 Z @card@
> > horas NCFP hora
> > no SP+DA em+o
> > Estádio NCMP Estádio
> > da SPS de+a
> > Luz NCMP Luz
> > , Fc ,
> > no SP+DA em+o
> > clássico NCMS clássico
> > que PR0 que
> > vale VMI valer
> > a DA0 o
> > liderança NCFS liderança
> > . Fp .
>
> and
>
> > Boa ADJ.Fem.Sing bom
> > tarde NOUN.Fem.Sing tarde
> > . PUNCT.Sent .
> > É AUX.Fin.Sing ser
> > este DET.Masc.Sing este
> > o DET.Masc.Sing o
> > dia NOUN.Masc.Sing dia
> > : PUNCT.Colon :
> > Benfica PROPN.Masc.Sing Benfica
> > e CCONJ e
> > FC PROPN.Masc.Sing FC
> > Porto PROPN.Sing Porto
> > defrontam-se VERB.Fin.Sing defrontam-se
> > a ADP a
> > partir NOUN partir
> > das ADP_DET.Fem.Plur de_o
> > 18 NUM 18
> > horas NOUN.Fem.Plur hora
> > no ADP_DET.Masc.Sing em_o
> > Estádio PROPN.Masc.Sing Estádio
> > da ADP_DET.Fem.Sing de_o
> > Luz PROPN.Sing Luz
> > , PUNCT.Comma ,
> > no ADP_DET.Masc.Sing em_o
> > clássico NOUN.Masc.Sing clássico
> > que PRON.Rel.Masc.Sing que
> > vale VERB.Fin.Sing valer
> > a DET.Fem.Sing o
> > liderança NOUN.Fem.Sing liderança
> > . PUNCT.Sent .
>
>
> ---
>
> > Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML collocations.de>:
> >
> > This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
> >
> > Most other TreeTagger wrappers now use a Perl script
> >
> > TOKENIZER=${CMD}/utf8-tokenize.perl
> >
> > instead of the flex tokenizer
> >
> > TOKENIZER=${BIN}/separate-punctuation
> >
> > Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer? It probably depends on whether you need the clitics- and contraction-splitting in
> >
> > SPLITTER=${CMD}/portuguese-splitter.perl
> >
> > which the alternative parameter file doesn't seem to do.
> >
> > Best,
> > Stefan
> >
> >
> >
> >
> >> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier tu-berlin.de> wrote:
> >>
> >>
> >> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
> >>
> >> ---
> >>
> >> #!/bin/sh
> >>
> >> # Set these paths appropriately
> >>
> >> BIN=/Users/Simon/TreeTagger/bin
> >> CMD=/Users/Simon/TreeTagger/cmd
> >> LIB=/Users/Simon/TreeTagger/lib
> >>
> >> TOKENIZER=${BIN}/separate-punctuation
> >> SPLITTER=${CMD}/portuguese-splitter.perl
> >> TAGGER=${BIN}/tree-tagger
> >> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
> >> POST_TAGGING=${CMD}/portuguese-post-tagging
> >> PARFILE=${LIB}/portuguese-utf8.par
> >>
> >> # splitting
> >> $SPLITTER $* |
> >> # pre-tokenization
> >> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
> >> # tokenizing
> >> $TOKENIZER +1 +s +l $ABBR_LIST |
> >> # remove empty lines
> >> grep -v '^$' |
> >> # tagging
> >> $TAGGER $PARFILE -token -lemma -no-unknown -sgml |
> >> $POST_TAGGING -no
> >>
> >
> > _______________________________________________
> > CWB mailing list
> > CWB sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
> _______________________________________________
> CWB mailing list
> CWB sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list