[CWB] Problems with TreeTagger Portuguese
Meier-Vieracker, Simon
simon.meier at tu-berlin.de
Tue Aug 28 13:51:00 CEST 2018
Thanks, Stefan, for these hints!
Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...
So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.
Best, Simon
> Boa AQ0 bom
> tarde NCFS tarde
> . Fp .
> É VMI ser
> este PD0 este
> o DA0 o
> dia NCMS dia
> : Fd :
> Benfica NCMS Benfica
> e CC e
> FC NCMS FC
> Porto VMI portar
> defrontam-se NCFS defrontam-se
> a SPS a
> partir VMN partir
> das SPS de+as
> 18 Z @card@
> horas NCFP hora
> no SP+DA em+o
> Estádio NCMP Estádio
> da SPS de+a
> Luz NCMP Luz
> , Fc ,
> no SP+DA em+o
> clássico NCMS clássico
> que PR0 que
> vale VMI valer
> a DA0 o
> liderança NCFS liderança
> . Fp .
and
> Boa ADJ.Fem.Sing bom
> tarde NOUN.Fem.Sing tarde
> . PUNCT.Sent .
> É AUX.Fin.Sing ser
> este DET.Masc.Sing este
> o DET.Masc.Sing o
> dia NOUN.Masc.Sing dia
> : PUNCT.Colon :
> Benfica PROPN.Masc.Sing Benfica
> e CCONJ e
> FC PROPN.Masc.Sing FC
> Porto PROPN.Sing Porto
> defrontam-se VERB.Fin.Sing defrontam-se
> a ADP a
> partir NOUN partir
> das ADP_DET.Fem.Plur de_o
> 18 NUM 18
> horas NOUN.Fem.Plur hora
> no ADP_DET.Masc.Sing em_o
> Estádio PROPN.Masc.Sing Estádio
> da ADP_DET.Fem.Sing de_o
> Luz PROPN.Sing Luz
> , PUNCT.Comma ,
> no ADP_DET.Masc.Sing em_o
> clássico NOUN.Masc.Sing clássico
> que PRON.Rel.Masc.Sing que
> vale VERB.Fin.Sing valer
> a DET.Fem.Sing o
> liderança NOUN.Fem.Sing liderança
> . PUNCT.Sent .
---
> Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML at collocations.de>:
>
> This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
>
> Most other TreeTagger wrappers now use a Perl script
>
> TOKENIZER=${CMD}/utf8-tokenize.perl
>
> instead of the flex tokenizer
>
> TOKENIZER=${BIN}/separate-punctuation
>
> Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer? It probably depends on whether you need the clitics- and contraction-splitting in
>
> SPLITTER=${CMD}/portuguese-splitter.perl
>
> which the alternative parameter file doesn't seem to do.
>
> Best,
> Stefan
>
>
>
>
>> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
>>
>>
>> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
>>
>> ---
>>
>> #!/bin/sh
>>
>> # Set these paths appropriately
>>
>> BIN=/Users/Simon/TreeTagger/bin
>> CMD=/Users/Simon/TreeTagger/cmd
>> LIB=/Users/Simon/TreeTagger/lib
>>
>> TOKENIZER=${BIN}/separate-punctuation
>> SPLITTER=${CMD}/portuguese-splitter.perl
>> TAGGER=${BIN}/tree-tagger
>> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
>> POST_TAGGING=${CMD}/portuguese-post-tagging
>> PARFILE=${LIB}/portuguese-utf8.par
>>
>> # splitting
>> $SPLITTER $* |
>> # pre-tokenization
>> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
>> # tokenizing
>> $TOKENIZER +1 +s +l $ABBR_LIST |
>> # remove empty lines
>> grep -v '^$' |
>> # tagging
>> $TAGGER $PARFILE -token -lemma -no-unknown -sgml |
>> $POST_TAGGING -no
>>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list