[CWB] Problems with TreeTagger Portuguese
Hardie, Andrew
a.hardie at lancaster.ac.uk
Tue Aug 28 15:05:09 CEST 2018
Hi Simon,
The tagsets are FreeLing-based and Universal Dependencies-based in your two examples respectively. Off the top of my head I do not know which is more widely used for Portuguese. The SketchEngine PtTenTen and Mark Davies' corpusdoportugues.org are both POS tagged .... but each of those uses yet another tagset!
In sum I suggest that you are probably safe with either tagset, unless your users have prior experience of one or the other, in which case obviously prefer that one.
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Meier-Vieracker, Simon
Sent: 28 August 2018 12:51
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Problems with TreeTagger Portuguese
Thanks, Stefan, for these hints!
Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...
So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.
Best, Simon
> Boa AQ0 bom
> tarde NCFS tarde
> . Fp .
> É VMI ser
> este PD0 este
> o DA0 o
> dia NCMS dia
> : Fd :
> Benfica NCMS Benfica
> e CC e
> FC NCMS FC
> Porto VMI portar
> defrontam-se NCFS defrontam-se
> a SPS a
> partir VMN partir
> das SPS de+as
> 18 Z @card@
> horas NCFP hora
> no SP+DA em+o
> Estádio NCMP Estádio
> da SPS de+a
> Luz NCMP Luz
> , Fc ,
> no SP+DA em+o
> clássico NCMS clássico
> que PR0 que
> vale VMI valer
> a DA0 o
> liderança NCFS liderança
> . Fp .
and
> Boa ADJ.Fem.Sing bom
> tarde NOUN.Fem.Sing tarde
> . PUNCT.Sent .
> É AUX.Fin.Sing ser
> este DET.Masc.Sing este
> o DET.Masc.Sing o
> dia NOUN.Masc.Sing dia
> : PUNCT.Colon :
> Benfica PROPN.Masc.Sing Benfica
> e CCONJ e
> FC PROPN.Masc.Sing FC
> Porto PROPN.Sing Porto
> defrontam-se VERB.Fin.Sing defrontam-se
> a ADP a
> partir NOUN partir
> das ADP_DET.Fem.Plur de_o
> 18 NUM 18
> horas NOUN.Fem.Plur hora
> no ADP_DET.Masc.Sing em_o
> Estádio PROPN.Masc.Sing Estádio
> da ADP_DET.Fem.Sing de_o
> Luz PROPN.Sing Luz
> , PUNCT.Comma ,
> no ADP_DET.Masc.Sing em_o
> clássico NOUN.Masc.Sing clássico
> que PRON.Rel.Masc.Sing que
> vale VERB.Fin.Sing valer
> a DET.Fem.Sing o
> liderança NOUN.Fem.Sing liderança
> . PUNCT.Sent .
---
> Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML at collocations.de>:
>
> This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
>
> Most other TreeTagger wrappers now use a Perl script
>
> TOKENIZER=${CMD}/utf8-tokenize.perl
>
> instead of the flex tokenizer
>
> TOKENIZER=${BIN}/separate-punctuation
>
> Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer? It probably depends on whether you need the clitics- and contraction-splitting in
>
> SPLITTER=${CMD}/portuguese-splitter.perl
>
> which the alternative parameter file doesn't seem to do.
>
> Best,
> Stefan
>
>
>
>
>> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
>>
>>
>> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
>>
>> ---
>>
>> #!/bin/sh
>>
>> # Set these paths appropriately
>>
>> BIN=/Users/Simon/TreeTagger/bin
>> CMD=/Users/Simon/TreeTagger/cmd
>> LIB=/Users/Simon/TreeTagger/lib
>>
>> TOKENIZER=${BIN}/separate-punctuation
>> SPLITTER=${CMD}/portuguese-splitter.perl
>> TAGGER=${BIN}/tree-tagger
>> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
>> POST_TAGGING=${CMD}/portuguese-post-tagging
>> PARFILE=${LIB}/portuguese-utf8.par
>>
>> # splitting
>> $SPLITTER $* |
>> # pre-tokenization
>> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
>> # tokenizing
>> $TOKENIZER +1 +s +l $ABBR_LIST |
>> # remove empty lines
>> grep -v '^$' |
>> # tagging
>> $TAGGER $PARFILE -token -lemma -no-unknown -sgml |
>> $POST_TAGGING -no
>>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list