[CWB] Multi-word units
Stefan Evert
stefanML at collocations.de
Thu Feb 14 23:02:16 CET 2013
> If I understand the problem correctly - the normal way that I would do this would be to encode the original orthography (with as-is token breaks) and a normalised orthography (with normalised token breaks) as two separate attributes (either 2 p-attributes, or one p-attribute with the normalised version and one s-attribute with the original version).
But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?
CWB, like most other search tools, doesn't support multiple tokenizations of the same text. If you want to mark multi-word units as single words, you can do this the BNC way with XML tags.
<mw pos="adverb" lemma="apressurada mientre">
apressurada ...
mientre ...
</mw>
You'll have to teach your users always to search for single-token as well as multi-token words, e.g. like this:
([pos="adverb"] | <mw_pos="adverb"> []+ </mw_pos>)
and it's a bit tricky to get frequency counts including both types of words. If you provide a Web interface, you might be able to hide the complexity behind a clever front-end.
On the other hand, if you feel that most users would treat this as a single word, then it might make sense to tokenize the text as such:
apressurada mientre ...
and add one or more extra p-attributes that specify the internal structure of the word (e.g. POS tags of the component tokens).
Best,
Stefan
More information about the CWB
mailing list