[CWB] Multi-word units

Thu Feb 14 23:02:16 CET 2013

> If I understand the problem correctly - the normal way that I would do this would be to encode the original orthography (with as-is token breaks) and a normalised orthography (with normalised token breaks) as two separate attributes (either 2 p-attributes, or one p-attribute with the normalised version and one s-attribute with the original version).

But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?

CWB, like most other search tools, doesn't support multiple tokenizations of the same text.  If you want to mark multi-word units as single words, you can do this the BNC way with XML tags.

  <mw pos="adverb" lemma="apressurada mientre">
  apressurada ...
  mientre ...
  </mw>

You'll have to teach your users always to search for single-token as well as multi-token words, e.g. like this:

  ([pos="adverb"] | <mw_pos="adverb"> []+ </mw_pos>)

and it's a bit tricky to get frequency counts including both types of words.  If you provide a Web interface, you might be able to hide the complexity behind a clever front-end.

On the other hand, if you feel that most users would treat this as a single word, then it might make sense to tokenize the text as such:

  apressurada mientre ...

and add one or more extra p-attributes that specify the internal structure of the word (e.g. POS tags of the component tokens).

Best,
Stefan