[CWB] Problems with TreeTagger Portuguese

Stefan Evert stefanML at collocations.de
Tue Aug 28 12:58:41 CEST 2018


This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).

Most other TreeTagger wrappers now use a Perl script

	TOKENIZER=${CMD}/utf8-tokenize.perl 

instead of the flex tokenizer

	TOKENIZER=${BIN}/separate-punctuation

Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer?  It probably depends on whether you need the clitics- and contraction-splitting in

	SPLITTER=${CMD}/portuguese-splitter.perl

which the alternative parameter file doesn't seem to do.

Best,
Stefan
 



> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
> 
> 
> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
> 
> ---
> 
> #!/bin/sh
> 
> # Set these paths appropriately
> 
> BIN=/Users/Simon/TreeTagger/bin
> CMD=/Users/Simon/TreeTagger/cmd
> LIB=/Users/Simon/TreeTagger/lib
> 
> TOKENIZER=${BIN}/separate-punctuation
> SPLITTER=${CMD}/portuguese-splitter.perl
> TAGGER=${BIN}/tree-tagger
> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
> POST_TAGGING=${CMD}/portuguese-post-tagging
> PARFILE=${LIB}/portuguese-utf8.par
> 
> # splitting 
> $SPLITTER $* |
> # pre-tokenization
> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
> # tokenizing
> $TOKENIZER +1 +s +l $ABBR_LIST |
> # remove empty lines
> grep -v '^$' |
> # tagging
> $TAGGER $PARFILE -token -lemma -no-unknown -sgml | 
> $POST_TAGGING -no
> 



More information about the CWB mailing list