[CWB] cwb testing

Sat Jul 8 16:45:21 CEST 2006

Dear CWB-devvers,

I am told that, during the CWB meeting in Forli' in late May, I declared
that I will be in charge of preparing a testing suite to test new versions
of CWB against the current one (I was very tired during the whole meeting,
and I probably promised this while talking in my sleep...)

Something I do remember of this foggy event is that I said that I was going
to take over this job if and only if we could not find somebody else
interested in doing it, and that, if I ended up being the one in charge, I
would prepare something very very basic, just to make sure that newer
versions of CWB do not explode causing people to be injured and things like
that.

So, first of all: is there anybody interested in taking this up? If not, is
there somebody willing to collaborate with me on this? (If you are not a
hardcore C programmer, this is a very good way to make yourself useful to the
development of CWB, taking off some job from people like Stefan, so they
can focus on the hardcore programming with one less aspect of the project
to worry about; and it might be an occasion to learn something about
sourceforging, software engineering, code versioning, testing, etc.)

Second of all, some quick notes on how a minimal testing suite should look
like.

Given that I can't spend too much time on this, I would like to stick to
the principle that evaluation is only in terms of identity of results, not
performance nor similarity of results (of course, before we can test for
identity, we will have to apply some  regular expressing for things that
are obviously bound to change, such as version numbers and dates., the test
will just be whether new results are identical to old results or not).

I.e., the testing suite only checks that results are same as in
current version, but does not send warning if, e.g., it now takes 3 hours
to do something that took 3 seconds before (although perhaps data on query
times could be stored in a log, if somebody wants to look at them).

Moreover, we will not make an attempt to identify problem or assess degrees
of similarity: the program only detects difference in output of operation:
outputs that differ by 1 character in a few megabytes and strings that
differ because the new output just says "segmentation fault" are treated in
the same way: the program will return some non-illuminating message such as
"query results differ" in both cases; it is up to the  person who does the
debugging to interpret the difference.

There will be three data components:

1) the test corpora/corpus
2) the queries to be performed on the test corpora
3) the results with the current (= previous wrt the version being tested)
version of cwb (the "gold standard")

I like to think that 2) and 3)  can be easy. 2) could just be an
intelligent sample of queries from the tutorial.

The results in 3) will have to be stored somehow, but as long as I don't
care about space  efficiency (and why should I? ;-), results could be kept
in a plain text file or in a db, and matching, after some pre-processing of
the new results, could be reduced to testing for equality of strings.
Actually, now that I think about it, I could just take fingerprints of the
original results, store those, and compare the new fingerprints against the
old ones. This would kill all hopes to find where exactly any error was,
but  as long as we keep track of the query where failure happened,
debuggers would know where to look at. Moreover, in the distribution for
regular users, we could include the fingerprints only, making the
distribution lighter.

We will have to constantly update our database of  "current" queries, so
that we can test on increasingly larger test corpora (e.g., I imagine a
cycle like: we test version 2 comparing it to version 1 on a corpus of,
say, 350M words, that is near the maximum that version 1 can deal with, but
not the maximum that version 2 can deal with; if v2 passes the test, we
create a new gold standard based on queries made with version 2 to a corpus
of, say 1 billion words, that is the maximum v2 can deal with, and that's
what we'll use to test v3, and so on).

This leaves the question of 1) open.

1a) One possibility  would be to keep a few full corpora as part of the
testing suite. For example, we could use Dickens or German Law, as they are
richly annotated, albeit small, and then a big chunk of one or more of our
Web corpora, to test CWB with larger inputs

1b) Another possibility, suggested by Stefan, would be to generate one or
more random corpora (from fixed seeds, for replicability), and run tests on
them.

Advantages of 1a): less work; we test the system on realistic data (which
might have some weird asymmetry that turns out to be a problem for CWB and
would not come out in randomly generated data)

Problems of 1a): test suite becomes hard to distribute, since it will be
huge; only features that are present in the target corpora are tested

Advantages of 1b): we can distribute scripts that generate the
corpus/corpora, which will be considerably lighter than the corpora, but
could generate corpora as large as needed; we could test a larger range of
corpus typologies, albeit  artificially generated

Problems of 1b): more work (although generating artificial corpora sounds
like a fun task); there might be some weird distribution that happens in
real data, and is problematic for CWB, but  we would completely miss in a
random corpus (but, then again, there is no guarantee that the specific
real corpora we could be using instead would include the interesting
problem cases)

So, I'm fairly convinced that 1b) is the way to go.

How should the random corpora be generated?

I'm thinking about something along the following lines.

The only free parameter is the number of documents to be generated, which
indirectly determines the approximate size of the resulting corpus.

For the specified number of docs to be generated, the program first
generates a document section.

<document id="1" att1="xlfklghjngothg" att2="a" att3="yes">
...
</document>

id is an increasing numerical id, att1 is a random string of arbitrary
length (in a range of, say, up to 1000 characters, att2 has 10 equiprobable
possible values , att3 has 2 values (with probability, say, 80% and 20%,
respectively).
This should allow us to test for a plausible typology of attributes.

Next, for each document, there is a probability of, say, 30% that a
<header> element will be generated (so that we account for the presence of
optional structure inside documents). If it is generated, it will contain a
number of sentences between 1 and 10, each possibility equiprobable (for
the sentences, see below).

Next, a <body> attribute is generated. The number of sentences for the body
is obtained from a binomial distribution  with n = 400 and p = .50 (so that
most texts would have about 200 sentences).

For each sentence <s>, it is decided whether it is superlong (p 0.00005) or
standard. If it's superlong, it will have a number of units samples from a
distribution centered around 1000 tokens or so; if it's standard, it will
have a binomial distribution (or similar) centered around 10 (allowing for
-- rare -- occurrences of empty sentences).

For each unit, it is decided whether it is a token (prob .7), or a
multi-word structure (prob. .3), simulating phrases in a partial parse or
multi-word units.

If it is a multi-word structure, one of four tags (say, <ap>, <bp>, <cp> or
<dp>) is generated. Then, a number of tokens between 1 and 5 (equiprobable)
is generated inside the mw structure.

A token, inside the header, inside a mw structure or directly under <body>,
has 4 positional attributes:

word lemma pos green

For the word attribute, first a length is determined as follows. First, it
is decided whether the word is superlong (p 0.0001) or standard.

If the word is superlong, its length is determined from some distribution
centered around, say, 1000 characters or something like that.

If the word is standard, word length will be, say, based on a binomial
distribution with, say, n=15 and p=.25, or something like that (avoiding
0s, I guess).

A word is then generated by picking as many characters from an alphabet as
the length sampled at the previous stage requires (characters could be
equiprobable or have different probabilities -- in any case, I don't plan
to implement phonotactic constraints! ;-) Although this is not like Wentian
Li's monkey language experiment, I would expect it to generate a
distribution of word frequencies that is reasonably Zipfian, as words like
"a" and  "yl" will be very frequent, and there will be a multitude of
different long types such as "fkljeskjg".

(You might wonder why not using directly a unigram language model trained
on a real corpus: because that would be relatively big, whereas the current
proposal would mean that a relatively  small program that does not need any
external resource can generate a very large corpus; plus, in this way, if
we want to do extensive testing, we can generate a lot of different
corpora, perhaps playing with various parameters; plus, this sounds like fun!)

The lemma is constructed by truncating any word longer than 5 characters to
the first 5 characters and by lower-casing (assuming we are sampling from
an alphabet where it makes sense to distinguish between upper and lower
case), so that, like with real-life lemmas, the lemma distribution should
have more high frequency items and less low frequency items, while still
been a very type-rich distribution.

POS's would be picked from a list of, say, 50 elements, with probability
(am I becoming too baroque?) conditioned on the length of the word, so that
tags more likely to occur with short words will be token-rich, like
function word tags, whereas tags more likely to occur with longer words
will be type-rich, like content word tags. At the same time, tags are not
associated to particular words, so there will be a good amount of
ambiguity, for short frequent words at least. (As I understand it, nothing
in the current version of CWB hinges on the joint distribution of
positional attributes, so all this probably does not matter, but just in
case...)

Finally, green is a binary attribute, with distribution P(no)=.8;
P(yes)=.2, just to try yet another way people might decide to encode
information, that leads to an attribute with a different distribution from
the ones of words, lemmas and pos's.

When all the units belonging to a <s> have been generated, end-of-sentence
punctuation might be introduced (with p=.9; if not, the sentence ends
without punctuation); eos punctuation marks are special tokens, such as:

.	.	EOS	no

There are very few of them, so that they will be highly frequent (they can
be sampled from a distribution such as p(.)=.7; p(?)=.2; p(!)=.1, or
something like that).

In this way, we will be able to generate richly annotated corpora,
illustrating various frequency distributions, of arbitrary size.

I've been vague about the issue of the alphabet we are sampling form, but I
would be inclined to say that it should be a latin1-like alphabet (as Asian
scripts will have very different character frequency distributions, and I
dont' want to think about that right now). Re the encoding... dunno...
would it make sense/be worth it to generate 2 corpora, one in, say, latin-1
and one in utf8, or, since we already know that CWB supports all encodings
known to mankind we don't worry about this?

We could have 2 testing modes: one that is run as part of the standard
installation procedure, where a random corpus of, say, 10M tokens is
generated, indexed and used for testing; one that is run by the official
testers before a release, and uses a random corpus that is near the upper
limit of tokens that the previous version of CWB could handle (and of
course testing could then take hours or days). We could also have a third
mode, where we generate a corpus as large as the tested version of CWB is
supposed to handle, and we  just check if CWB survives indexing and a few
queries, without comparing it to any previous version.

I thought of implementing all this as one or more perl and bash scripts
that do both indexing and querying by using Stefan's modules, although
Stefan seemed to have doubts about whether that's a good idea... I don't
really see any low effort alternative... or?

Does this sound reasonable? Comments? Alternative ideas?

Please feed back.

Regards,

Marco

-- 
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni

Leadership is a form of evil. No one needs to lead you to do something
that is obviously good for you.

(Scott Adams)