[CWB] Problems with TreeTagger Portuguese

Tue Aug 28 13:51:00 CEST 2018

Thanks, Stefan, for these hints! 

Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...

So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.

Best, Simon

> Boa	AQ0	bom
> tarde	NCFS	tarde
> .	Fp	.
> É	VMI	ser
> este	PD0	este
> o	DA0	o
> dia	NCMS	dia
> :	Fd	:
> Benfica	NCMS	Benfica
> e	CC	e
> FC	NCMS	FC
> Porto	VMI	portar
> defrontam-se	NCFS	defrontam-se
> a	SPS	a
> partir	VMN	partir
> das	SPS	de+as
> 18	Z	@card@
> horas	NCFP	hora
> no	SP+DA	em+o
> Estádio	NCMP	Estádio
> da	SPS	de+a
> Luz	NCMP	Luz
> ,	Fc	,
> no	SP+DA	em+o
> clássico	NCMS	clássico
> que	PR0	que
> vale	VMI	valer
> a	DA0	o
> liderança	NCFS	liderança
> .	Fp	.

and

> Boa	ADJ.Fem.Sing	bom
> tarde	NOUN.Fem.Sing	tarde
> .	PUNCT.Sent	.
> É	AUX.Fin.Sing	ser
> este	DET.Masc.Sing	este
> o	DET.Masc.Sing	o
> dia	NOUN.Masc.Sing	dia
> :	PUNCT.Colon	:
> Benfica	PROPN.Masc.Sing	Benfica
> e	CCONJ	e
> FC	PROPN.Masc.Sing	FC
> Porto	PROPN.Sing	Porto
> defrontam-se	VERB.Fin.Sing	defrontam-se
> a	ADP	a
> partir	NOUN	partir
> das	ADP_DET.Fem.Plur	de_o
> 18	NUM	18
> horas	NOUN.Fem.Plur	hora
> no	ADP_DET.Masc.Sing	em_o
> Estádio	PROPN.Masc.Sing	Estádio
> da	ADP_DET.Fem.Sing	de_o
> Luz	PROPN.Sing	Luz
> ,	PUNCT.Comma	,
> no	ADP_DET.Masc.Sing	em_o
> clássico	NOUN.Masc.Sing	clássico
> que	PRON.Rel.Masc.Sing	que
> vale	VERB.Fin.Sing	valer
> a	DET.Fem.Sing	o
> liderança	NOUN.Fem.Sing	liderança
> .	PUNCT.Sent	.

---

> Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML at collocations.de>:
> 
> This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
> 
> Most other TreeTagger wrappers now use a Perl script
> 
> 	TOKENIZER=${CMD}/utf8-tokenize.perl 
> 
> instead of the flex tokenizer
> 
> 	TOKENIZER=${BIN}/separate-punctuation
> 
> Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer?  It probably depends on whether you need the clitics- and contraction-splitting in
> 
> 	SPLITTER=${CMD}/portuguese-splitter.perl
> 
> which the alternative parameter file doesn't seem to do.
> 
> Best,
> Stefan
> 
> 
> 
> 
>> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
>> 
>> 
>> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
>> 
>> ---
>> 
>> #!/bin/sh
>> 
>> # Set these paths appropriately
>> 
>> BIN=/Users/Simon/TreeTagger/bin
>> CMD=/Users/Simon/TreeTagger/cmd
>> LIB=/Users/Simon/TreeTagger/lib
>> 
>> TOKENIZER=${BIN}/separate-punctuation
>> SPLITTER=${CMD}/portuguese-splitter.perl
>> TAGGER=${BIN}/tree-tagger
>> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
>> POST_TAGGING=${CMD}/portuguese-post-tagging
>> PARFILE=${LIB}/portuguese-utf8.par
>> 
>> # splitting 
>> $SPLITTER $* |
>> # pre-tokenization
>> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
>> # tokenizing
>> $TOKENIZER +1 +s +l $ABBR_LIST |
>> # remove empty lines
>> grep -v '^$' |
>> # tagging
>> $TAGGER $PARFILE -token -lemma -no-unknown -sgml | 
>> $POST_TAGGING -no
>> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb