[CWB] CWB4 and Ziggurat

Fri Oct 9 16:59:33 CEST 2015

Dear Yannick,

thanks for your feedback and those pointers – I have looked at a number of existing indexing/database engines over the last 10 years in an effort to avoid re-inventing everything ourselves, but I hadn't been aware of these particular packages.

To kick off the discussion, let me briefly summarize the reasons why Andrew and I have decided not to build on an existing indexing engine:

1) Apache Parquet, Lucene, … are all written in Java.  We don't do Java.

2) Most of the recent developments (e.g. Apache Parquet) are aimed at massively distributed databases.  CWB4 should be useable on a standard desktop computer or laptop.

3) I'm not aware of any standard DB library that is optimized for a static databases.  All of them support updates and transactions, which necessarily introduces additional overhead and complexity.  Of course, highly optimized code and complex file formats with special-case optimizations might make up for this, but we're not convinced that we'll gain that much compared to our much simpler static file format.

4) Most importantly: all the engines we've found would require us to buy into a very large and complex infrastructure.  By the time we've learned all the details of such an infrastructure, we could probably have implemented CWB4 from scratch with much better control and understanding of its internals.

We'd love to hear your arguments why we should/might reconsider these decisions.

> 1) horizontal stability: I've seen time and again that this is necessary for any kind of fast access.

But that's exactly what all the other indexing engines _don't_ require …

>     See e.g. this overview on techniques on columnar databases 
>     http://www.cs.yale.edu/homes/dna/papers/abadi-column-stores.pdf

Doesn't "columnar database" just mean that the database community finally realized that the CWB3 data model wasn't such a bad idea after all?

Don't get me wrong – I have looked at MonetDB more than once before and would love to build on a mature and optimized engine, but I figured it would take me at least several weeks and a lot of source-code investigation to figure out whether MonetDB supports everything we need for sophisticated corpus queries.

>     There are also freely available implementations of query compilers, such as this one,
>     if you want to design the storage layer yourself.
>     https://github.com/uwescience/raco

I think we don't want to do this in relational algebra.

> 3) scriptability: I'm not sure whether everyone really wants to learn Lua.

Those are just some wild ideas that might never get implemented; but I would find some general scripting capability in CQP extremely useful.  Even if I have to learn Lua. :-)

I'd suggest to do it in Python if the Python interpreter weren't quite as bad as it is.  Slower than R, seriously?

>     What I would like to see, though, is to have some public API that both allows
>     access to the query compiler/execution engine and to raw data columns (or
>     lists/vectors of structures), and that can be wrapped using the FFI of your
>     favorite language.

The raw data API already exists in CWB3.  It will also be a C API in CWB4, but that's easy enough to wrap for most languages.

Implementing the query engine as a separate library with documented API is also a clear requirement for CWB4.

Thanks again for your thoughts and let's continue the discussion!
Stefan