[CWB] Problems with TreeTagger Portuguese
Meier-Vieracker, Simon
simon.meier at tu-berlin.de
Tue Aug 28 12:20:21 CEST 2018
Hi,
does anyone have experience with the TreeTagger for Portuguese?
Using the ready-made script as shown below, it does not recognize xml-tags and splits them during the tokenization process, although the -sgml option is set.
E.g. the output is
> <p>Boa VMI <p>Boa
> tarde RG tarde
> . Fp .
I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
---
#!/bin/sh
# Set these paths appropriately
BIN=/Users/Simon/TreeTagger/bin
CMD=/Users/Simon/TreeTagger/cmd
LIB=/Users/Simon/TreeTagger/lib
TOKENIZER=${BIN}/separate-punctuation
SPLITTER=${CMD}/portuguese-splitter.perl
TAGGER=${BIN}/tree-tagger
ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
POST_TAGGING=${CMD}/portuguese-post-tagging
PARFILE=${LIB}/portuguese-utf8.par
# splitting
$SPLITTER $* |
# pre-tokenization
sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
# tokenizing
$TOKENIZER +1 +s +l $ABBR_LIST |
# remove empty lines
grep -v '^$' |
# tagging
$TAGGER $PARFILE -token -lemma -no-unknown -sgml |
$POST_TAGGING -no
---
Thanks for any hints!
Best, Simon
More information about the CWB
mailing list