[CWB] Suggestion: user intervention in constructing an index

David Lukeš david.lukes at ff.cuni.cz
Fri Mar 23 18:42:51 CET 2018


 > I find it theoretically unsatisfactory. A case for Occam's razor, in 
fact.

That's the thing though -- you're applying scientific criteria to what is
essentially an engineering problem :) I completely agree that from the 
point of
view of linguistics, listing before and after pairs ("the plural of car 
is cars,
the plural of book is books...") is highly unsatisfactory compared to 
working
out general rules ("the plural in English is often created by appending 
an -s").

But I hope nobody would consider a corpus index a linguistic 
description, it's
just a tool which helps us achieve very specific practical goals. And since
there's already a simple and reasonably efficient way of achieving the 
behavior
you want with the tool as it is, implementing another way of doing the same
thing could actually be seen as violating *the engineer's* Occam's razor.

(To be clear, as a linguist working daily alongside engineers, I've also
occasionally felt the itch you're feeling, so I can sympathize.)

 > A word like "seanbhean" is actually more frequently spelled with a hyphen
 > between the parts

This sounds like people will be most likely to search for "sean-bhean"
(especially if they encounter it displayed as "sean-bhean" within the corpus
itself), which will yield no results if it's actually split into "sean" and
"bhean" under the hood.

Best,

David


More information about the CWB mailing list