[CWB] Problems with TreeTagger Portuguese

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Aug 28 15:05:09 CEST 2018


Hi Simon,

The tagsets are FreeLing-based and Universal Dependencies-based in your two examples respectively. Off the top of my head I do not know which is more widely used for Portuguese. The SketchEngine PtTenTen and Mark Davies' corpusdoportugues.org are both POS tagged .... but each of those uses yet another tagset!

In sum I suggest that you are probably safe with either tagset, unless your users have prior experience of one or the other, in which case obviously prefer that one.
	
best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Meier-Vieracker, Simon
Sent: 28 August 2018 12:51
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Problems with TreeTagger Portuguese

Thanks, Stefan, for these hints! 

Both parameter files portuguese-utf8.par and portuguese2-utf8.par seem to produce good output after standard tokenization with utf8-tokenize.perl. However, both tagsets are rather different (see below). Since I would like to provide the data for interested users in CQPweb, I wonder which tagset is more common and suitable? At the first glance, the second one seems to have a better error rate e.g. with proper nouns...

So if there is s.o. familiar with Portuguese and the standards of Portuguese corpus linguistics, I would be grateful for any advices.

Best, Simon

> Boa	AQ0	bom
> tarde	NCFS	tarde
> .	Fp	.
> É	VMI	ser
> este	PD0	este
> o	DA0	o
> dia	NCMS	dia
> :	Fd	:
> Benfica	NCMS	Benfica
> e	CC	e
> FC	NCMS	FC
> Porto	VMI	portar
> defrontam-se	NCFS	defrontam-se
> a	SPS	a
> partir	VMN	partir
> das	SPS	de+as
> 18	Z	@card@
> horas	NCFP	hora
> no	SP+DA	em+o
> Estádio	NCMP	Estádio
> da	SPS	de+a
> Luz	NCMP	Luz
> ,	Fc	,
> no	SP+DA	em+o
> clássico	NCMS	clássico
> que	PR0	que
> vale	VMI	valer
> a	DA0	o
> liderança	NCFS	liderança
> .	Fp	.

and

> Boa	ADJ.Fem.Sing	bom
> tarde	NOUN.Fem.Sing	tarde
> .	PUNCT.Sent	.
> É	AUX.Fin.Sing	ser
> este	DET.Masc.Sing	este
> o	DET.Masc.Sing	o
> dia	NOUN.Masc.Sing	dia
> :	PUNCT.Colon	:
> Benfica	PROPN.Masc.Sing	Benfica
> e	CCONJ	e
> FC	PROPN.Masc.Sing	FC
> Porto	PROPN.Sing	Porto
> defrontam-se	VERB.Fin.Sing	defrontam-se
> a	ADP	a
> partir	NOUN	partir
> das	ADP_DET.Fem.Plur	de_o
> 18	NUM	18
> horas	NOUN.Fem.Plur	hora
> no	ADP_DET.Masc.Sing	em_o
> Estádio	PROPN.Masc.Sing	Estádio
> da	ADP_DET.Fem.Sing	de_o
> Luz	PROPN.Sing	Luz
> ,	PUNCT.Comma	,
> no	ADP_DET.Masc.Sing	em_o
> clássico	NOUN.Masc.Sing	clássico
> que	PRON.Rel.Masc.Sing	que
> vale	VERB.Fin.Sing	valer
> a	DET.Fem.Sing	o
> liderança	NOUN.Fem.Sing	liderança
> .	PUNCT.Sent	.


---

> Am 28.08.2018 um 12:58 schrieb Stefan Evert <stefanML at collocations.de>:
> 
> This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
> 
> Most other TreeTagger wrappers now use a Perl script
> 
> 	TOKENIZER=${CMD}/utf8-tokenize.perl 
> 
> instead of the flex tokenizer
> 
> 	TOKENIZER=${BIN}/separate-punctuation
> 
> Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer?  It probably depends on whether you need the clitics- and contraction-splitting in
> 
> 	SPLITTER=${CMD}/portuguese-splitter.perl
> 
> which the alternative parameter file doesn't seem to do.
> 
> Best,
> Stefan
> 
> 
> 
> 
>> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
>> 
>> 
>> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
>> 
>> ---
>> 
>> #!/bin/sh
>> 
>> # Set these paths appropriately
>> 
>> BIN=/Users/Simon/TreeTagger/bin
>> CMD=/Users/Simon/TreeTagger/cmd
>> LIB=/Users/Simon/TreeTagger/lib
>> 
>> TOKENIZER=${BIN}/separate-punctuation
>> SPLITTER=${CMD}/portuguese-splitter.perl
>> TAGGER=${BIN}/tree-tagger
>> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
>> POST_TAGGING=${CMD}/portuguese-post-tagging
>> PARFILE=${LIB}/portuguese-utf8.par
>> 
>> # splitting 
>> $SPLITTER $* |
>> # pre-tokenization
>> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
>> # tokenizing
>> $TOKENIZER +1 +s +l $ABBR_LIST |
>> # remove empty lines
>> grep -v '^$' |
>> # tagging
>> $TAGGER $PARFILE -token -lemma -no-unknown -sgml | 
>> $POST_TAGGING -no
>> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list