[CWB] Problems with TreeTagger Portuguese
Stefan Evert
stefanML at collocations.de
Tue Aug 28 12:58:41 CEST 2018
This shell script uses a very old tokenizer that doesn't understand XML tags at all (and which I used to have to work around 20 years ago …).
Most other TreeTagger wrappers now use a Perl script
TOKENIZER=${CMD}/utf8-tokenize.perl
instead of the flex tokenizer
TOKENIZER=${BIN}/separate-punctuation
Can you work with the alternative Portuguese parameter file (tree-tagger-portuguese2), which uses the standard TreeTagger tokenizer? It probably depends on whether you need the clitics- and contraction-splitting in
SPLITTER=${CMD}/portuguese-splitter.perl
which the alternative parameter file doesn't seem to do.
Best,
Stefan
> On 28 Aug 2018, at 12:20, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
>
>
> I could tokenize first and then use the tree-tagger with the parameter file manually, of course, but I wonder if Portuguese has some clitics which demand for special tokenization...
>
> ---
>
> #!/bin/sh
>
> # Set these paths appropriately
>
> BIN=/Users/Simon/TreeTagger/bin
> CMD=/Users/Simon/TreeTagger/cmd
> LIB=/Users/Simon/TreeTagger/lib
>
> TOKENIZER=${BIN}/separate-punctuation
> SPLITTER=${CMD}/portuguese-splitter.perl
> TAGGER=${BIN}/tree-tagger
> ABBR_LIST=${LIB}/portuguese-abbreviations-utf8
> POST_TAGGING=${CMD}/portuguese-post-tagging
> PARFILE=${LIB}/portuguese-utf8.par
>
> # splitting
> $SPLITTER $* |
> # pre-tokenization
> sed "s/\([\)\"\'\?\!]\)\([\.\,\;\:]\)/ \1 \2/g" |
> # tokenizing
> $TOKENIZER +1 +s +l $ABBR_LIST |
> # remove empty lines
> grep -v '^$' |
> # tagging
> $TAGGER $PARFILE -token -lemma -no-unknown -sgml |
> $POST_TAGGING -no
>
More information about the CWB
mailing list