[CWB] Problems with TreeTagger Portuguese

Tue Aug 28 12:20:21 CEST 2018

Hi,

does anyone have experience with the TreeTagger for Portuguese?

Using the ready-made script as shown below, it does not recognize xml-tags and splits them during the tokenization process, although the -sgml option is set.

E.g. the output is

> <p>Boa	VMI	<p>Boa
> tarde	RG	tarde
> .	Fp	.

I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...

---

#!/bin/sh

# Set these paths appropriately

BIN=/Users/Simon/TreeTagger/bin
CMD=/Users/Simon/TreeTagger/cmd
LIB=/Users/Simon/TreeTagger/lib

TOKENIZER=${BIN}/separate-punctuation
SPLITTER=${CMD}/portuguese-splitter.perl
TAGGER=${BIN}/tree-tagger
ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
POST_TAGGING=${CMD}/portuguese-post-tagging
PARFILE=${LIB}/portuguese-utf8.par

# splitting 
$SPLITTER $* |
# pre-tokenization
sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
# tokenizing
$TOKENIZER +1 +s +l $ABBR_LIST |
# remove empty lines
grep -v '^$' |
# tagging
$TAGGER $PARFILE -token -lemma -no-unknown -sgml | 
$POST_TAGGING -no

---

Thanks for any hints!

Best, Simon