[CWB] CWB4 and Ziggurat

Yannick Versley yversley at gmail.com
Fri Oct 9 20:45:56 CEST 2015


Dear Stefan,

1) Apache Parquet, Lucene, … are all written in Java.  We don't do Java.
>
Parquet is a file format that allows you to cut up JSON data structures
(including lists and other hierarchical data)
into chunks of similar data (i.e., like a column database, but more liberal
on what the structures are - this may be
useful for the cases where you currently have strings and hashes).
There's a C++ library at https://github.com/Parquet/parquet-cpp

2) Most of the recent developments (e.g. Apache Parquet) are aimed at
> massively distributed databases.  CWB4 should be useable on a standard
> desktop computer or laptop.
>
Some massively distributed databases are built for a similar use case as
CQP (i.e., large data with few updates) and, in the case of Parquet,
contain a storage layer that can be used independently.
I wasn't trying to convince anyone to build CQP4 around a distributed
architecture based on Zookeeper and HDFS. Maybe that's for CQP5 ;-)

Doesn't "columnar database" just mean that the database community finally
> realized that the CWB3 data model wasn't such a bad idea after all?
>
Columnar databases have been around since the 70s, and they've been one of
the active research topics in databases in the last 10 years. And part of
what you write really reminds me of an old talk from Peter Boncz (of
MonetDB fame) about their sadly-not-open-source X100 project.

Here's an overview on Vectorwise (the product that X100 grew into), which
cites lots of DB research papers that are all concerned about
pulling compressed columns (or row/column chunks) into memory:
http://sites.computer.org/debull/A12mar/vectorwise.pdf
especially PFOR/PFOR-delta ("patched frame of reference" - a more
CPU-efficient way of doing what byte coding is meant to do)
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1617427

since these have been around for some time, I would expect at least
some open-source libraries to contain implementations of them. But
maybe it's exotic enough that it's limited to high-tech DB startups.

On the other hand, the SETS dependency query engine from Filip Ginter's
group 'just' runs on bitsets (a bit like CQP's IDList is a sparse set
representation).
https://github.com/fginter/dep_search
http://anthology.aclweb.org/N/N15/N15-3011.pdf

I think we don't want to do this in relational algebra.
>
At some point if you have graph structures in your data you will get
some kind of joins, and even with the current RE-over-annotated-tokens,
one could actually do some kind of query optimization if one were careful.

I'd suggest to do it in Python if the Python interpreter weren't quite as
> bad as it is.  Slower than R, seriously?
>
Slower than R would be weird. See here for a semi-scientific comparison
http://www.johndcook.com/blog/2014/06/20/benchmarking-c-python-r-etc/

For Python specifically, there is Cython (a kind of C/Python hybrid language
that allows you to get 90% of the speed of C with 90% of the convenience
of Python), as well as JIT compilers such as PyPy and Numba.

For my own work, I usually write some Cython code when Python and
the CQP query language take too long for some simple thing.
I have mostly come to the point where I can build PyPI-installable
packages using Cython code now; a friendlier approach would be
to use the conda packaging system that you get with the Anaconda
Python distribution for Windos/Mac/Linux (which also has Numba).

(Another alternative that I thought of would be Javascript. However
tempting that is, I think that both the steep learning curve for V8/NodeJS
FFI and the fact that Javascript isn't that widely taught make that less
attractive).

Implementing the query engine as a separate library with documented API is
> also a clear requirement for CWB4.
>
yay.

Another thing that came to my mind: currently it's hard/impossible to ship
a single
CQP corpus in a (zipped/tarred) directory and have it run out of the box.
It would
be quite neat if there was a solution for that.

Best wishes,
Yannick


>
>
> Thanks again for your thoughts and let's continue the discussion!
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151009/bdbd06f0/attachment.html>


More information about the CWB mailing list