[CWB] Suggestion: user intervention in constructing an index

Maarten Janssen maartenpt at gmail.com
Thu Mar 29 20:27:25 CEST 2018


The problem with <g/> like solutions is that they are a partial solution: if there are only spacing problems, it works; but often there is much more “lost” in the index CQP/Manatee corpus - such as split words (can’t, del, au), or your own demution cases. In those cases, removing the spaces via some some mechanism still does not give you back the text; it you are serious about showing the original, there is no real other solution than keeping; which is why ANNIS has a base-text layer, why many TEI-based corpora (and some CQP corpora) keep the original text on the <s/> nodes, and why TEITOK keeps byte-offsets to the XML file. Doing that, you can display the original text for a search-query (be it an aligned text, an attribute, or an XML fragment), with all its spaces and original orthography.


More information about the CWB mailing list