[CWB] cwb testing

Wed Jul 26 17:26:52 CEST 2006

Hi,

I'll be happy to write a perl script that does the actual testing; I 
guess each query should be run directly through cqp (throgh a shell 
command) and through the perl modules (since there are different things 
that might go wrong).

As for the testing corpus, I suggest that we use Dickens or German Law. 
If we want to automatically generate larger corpora, we could just 
duplicate the text in the smaller corpus. We could also use a realistic, 
very large corpus for testing, but make the test optional (for those 
that do not want to download it).

Anyway, what we use for testing purposes is not really important. Please 
send me queries you would like to have tested, and I'll write the test 
program.

-lars

Marco Baroni skrev:
> Dear CWB-devvers,
>
> I am told that, during the CWB meeting in Forli' in late May, I declared
> that I will be in charge of preparing a testing suite to test new 
> versions
> of CWB against the current one (I was very tired during the whole 
> meeting,
> and I probably promised this while talking in my sleep...)
>
> Something I do remember of this foggy event is that I said that I was 
> going
> to take over this job if and only if we could not find somebody else
> interested in doing it, and that, if I ended up being the one in 
> charge, I
> would prepare something very very basic, just to make sure that newer
> versions of CWB do not explode causing people to be injured and things 
> like
> that.
>
> So, first of all: is there anybody interested in taking this up? If 
> not, is
> there somebody willing to collaborate with me on this? (If you are not a
> hardcore C programmer, this is a very good way to make yourself useful 
> to the
> development of CWB, taking off some job from people like Stefan, so they
> can focus on the hardcore programming with one less aspect of the project
> to worry about; and it might be an occasion to learn something about
> sourceforging, software engineering, code versioning, testing, etc.)
>
> Second of all, some quick notes on how a minimal testing suite should 
> look
> like.
>
> Given that I can't spend too much time on this, I would like to stick to
> the principle that evaluation is only in terms of identity of results, 
> not
> performance nor similarity of results (of course, before we can test for
> identity, we will have to apply some  regular expressing for things that
> are obviously bound to change, such as version numbers and dates., the 
> test
> will just be whether new results are identical to old results or not).
>
> I.e., the testing suite only checks that results are same as in
> current version, but does not send warning if, e.g., it now takes 3 hours
> to do something that took 3 seconds before (although perhaps data on 
> query
> times could be stored in a log, if somebody wants to look at them).
>
> Moreover, we will not make an attempt to identify problem or assess 
> degrees
> of similarity: the program only detects difference in output of 
> operation:
> outputs that differ by 1 character in a few megabytes and strings that
> differ because the new output just says "segmentation fault" are 
> treated in
> the same way: the program will return some non-illuminating message 
> such as
> "query results differ" in both cases; it is up to the  person who does 
> the
> debugging to interpret the difference.
>
> There will be three data components:
>
> 1) the test corpora/corpus
> 2) the queries to be performed on the test corpora
> 3) the results with the current (= previous wrt the version being tested)
> version of cwb (the "gold standard")
>
> I like to think that 2) and 3)  can be easy. 2) could just be an
> intelligent sample of queries from the tutorial.
>
> The results in 3) will have to be stored somehow, but as long as I don't
> care about space  efficiency (and why should I? ;-), results could be 
> kept
> in a plain text file or in a db, and matching, after some 
> pre-processing of
> the new results, could be reduced to testing for equality of strings.
> Actually, now that I think about it, I could just take fingerprints of 
> the
> original results, store those, and compare the new fingerprints 
> against the
> old ones. This would kill all hopes to find where exactly any error was,
> but  as long as we keep track of the query where failure happened,
> debuggers would know where to look at. Moreover, in the distribution for
> regular users, we could include the fingerprints only, making the
> distribution lighter.
>
> We will have to constantly update our database of  "current" queries, so
> that we can test on increasingly larger test corpora (e.g., I imagine a
> cycle like: we test version 2 comparing it to version 1 on a corpus of,
> say, 350M words, that is near the maximum that version 1 can deal 
> with, but
> not the maximum that version 2 can deal with; if v2 passes the test, we
> create a new gold standard based on queries made with version 2 to a 
> corpus
> of, say 1 billion words, that is the maximum v2 can deal with, and that's
> what we'll use to test v3, and so on).
>
> This leaves the question of 1) open.
>
> 1a) One possibility  would be to keep a few full corpora as part of the
> testing suite. For example, we could use Dickens or German Law, as 
> they are
> richly annotated, albeit small, and then a big chunk of one or more of 
> our
> Web corpora, to test CWB with larger inputs
>
> 1b) Another possibility, suggested by Stefan, would be to generate one or
> more random corpora (from fixed seeds, for replicability), and run 
> tests on
> them.
>
> Advantages of 1a): less work; we test the system on realistic data (which
> might have some weird asymmetry that turns out to be a problem for CWB 
> and
> would not come out in randomly generated data)
>
> Problems of 1a): test suite becomes hard to distribute, since it will be
> huge; only features that are present in the target corpora are tested
>
> Advantages of 1b): we can distribute scripts that generate the
> corpus/corpora, which will be considerably lighter than the corpora, but
> could generate corpora as large as needed; we could test a larger 
> range of
> corpus typologies, albeit  artificially generated
>
> Problems of 1b): more work (although generating artificial corpora sounds
> like a fun task); there might be some weird distribution that happens in
> real data, and is problematic for CWB, but  we would completely miss in a
> random corpus (but, then again, there is no guarantee that the specific
> real corpora we could be using instead would include the interesting
> problem cases)
>
> So, I'm fairly convinced that 1b) is the way to go.
>
> How should the random corpora be generated?
>
> I'm thinking about something along the following lines.
>
> The only free parameter is the number of documents to be generated, which
> indirectly determines the approximate size of the resulting corpus.
>
> For the specified number of docs to be generated, the program first
> generates a document section.
>
> <document id="1" att1="xlfklghjngothg" att2="a" att3="yes">
> ...
> </document>
>
> id is an increasing numerical id, att1 is a random string of arbitrary
> length (in a range of, say, up to 1000 characters, att2 has 10 
> equiprobable
> possible values , att3 has 2 values (with probability, say, 80% and 20%,
> respectively).
> This should allow us to test for a plausible typology of attributes.
>
> Next, for each document, there is a probability of, say, 30% that a
> <header> element will be generated (so that we account for the 
> presence of
> optional structure inside documents). If it is generated, it will 
> contain a
> number of sentences between 1 and 10, each possibility equiprobable (for
> the sentences, see below).
>
> Next, a <body> attribute is generated. The number of sentences for the 
> body
> is obtained from a binomial distribution  with n = 400 and p = .50 (so 
> that
> most texts would have about 200 sentences).
>
> For each sentence <s>, it is decided whether it is superlong (p 
> 0.00005) or
> standard. If it's superlong, it will have a number of units samples 
> from a
> distribution centered around 1000 tokens or so; if it's standard, it will
> have a binomial distribution (or similar) centered around 10 (allowing 
> for
> -- rare -- occurrences of empty sentences).
>
> For each unit, it is decided whether it is a token (prob .7), or a
> multi-word structure (prob. .3), simulating phrases in a partial parse or
> multi-word units.
>
> If it is a multi-word structure, one of four tags (say, <ap>, <bp>, 
> <cp> or
> <dp>) is generated. Then, a number of tokens between 1 and 5 
> (equiprobable)
> is generated inside the mw structure.
>
> A token, inside the header, inside a mw structure or directly under 
> <body>,
> has 4 positional attributes:
>
> word lemma pos green
>
> For the word attribute, first a length is determined as follows. 
> First, it
> is decided whether the word is superlong (p 0.0001) or standard.
>
> If the word is superlong, its length is determined from some distribution
> centered around, say, 1000 characters or something like that.
>
> If the word is standard, word length will be, say, based on a binomial
> distribution with, say, n=15 and p=.25, or something like that (avoiding
> 0s, I guess).
>
> A word is then generated by picking as many characters from an 
> alphabet as
> the length sampled at the previous stage requires (characters could be
> equiprobable or have different probabilities -- in any case, I don't plan
> to implement phonotactic constraints! ;-) Although this is not like 
> Wentian
> Li's monkey language experiment, I would expect it to generate a
> distribution of word frequencies that is reasonably Zipfian, as words 
> like
> "a" and  "yl" will be very frequent, and there will be a multitude of
> different long types such as "fkljeskjg".
>
> (You might wonder why not using directly a unigram language model trained
> on a real corpus: because that would be relatively big, whereas the 
> current
> proposal would mean that a relatively  small program that does not 
> need any
> external resource can generate a very large corpus; plus, in this way, if
> we want to do extensive testing, we can generate a lot of different
> corpora, perhaps playing with various parameters; plus, this sounds 
> like fun!)
>
> The lemma is constructed by truncating any word longer than 5 
> characters to
> the first 5 characters and by lower-casing (assuming we are sampling from
> an alphabet where it makes sense to distinguish between upper and lower
> case), so that, like with real-life lemmas, the lemma distribution should
> have more high frequency items and less low frequency items, while still
> been a very type-rich distribution.
>
> POS's would be picked from a list of, say, 50 elements, with probability
> (am I becoming too baroque?) conditioned on the length of the word, so 
> that
> tags more likely to occur with short words will be token-rich, like
> function word tags, whereas tags more likely to occur with longer words
> will be type-rich, like content word tags. At the same time, tags are not
> associated to particular words, so there will be a good amount of
> ambiguity, for short frequent words at least. (As I understand it, 
> nothing
> in the current version of CWB hinges on the joint distribution of
> positional attributes, so all this probably does not matter, but just in
> case...)
>
> Finally, green is a binary attribute, with distribution P(no)=.8;
> P(yes)=.2, just to try yet another way people might decide to encode
> information, that leads to an attribute with a different distribution 
> from
> the ones of words, lemmas and pos's.
>
> When all the units belonging to a <s> have been generated, 
> end-of-sentence
> punctuation might be introduced (with p=.9; if not, the sentence ends
> without punctuation); eos punctuation marks are special tokens, such as:
>
> .    .    EOS    no
>
> There are very few of them, so that they will be highly frequent (they 
> can
> be sampled from a distribution such as p(.)=.7; p(?)=.2; p(!)=.1, or
> something like that).
>
> In this way, we will be able to generate richly annotated corpora,
> illustrating various frequency distributions, of arbitrary size.
>
> I've been vague about the issue of the alphabet we are sampling form, 
> but I
> would be inclined to say that it should be a latin1-like alphabet (as 
> Asian
> scripts will have very different character frequency distributions, and I
> dont' want to think about that right now). Re the encoding... dunno...
> would it make sense/be worth it to generate 2 corpora, one in, say, 
> latin-1
> and one in utf8, or, since we already know that CWB supports all 
> encodings
> known to mankind we don't worry about this?
>
> We could have 2 testing modes: one that is run as part of the standard
> installation procedure, where a random corpus of, say, 10M tokens is
> generated, indexed and used for testing; one that is run by the official
> testers before a release, and uses a random corpus that is near the upper
> limit of tokens that the previous version of CWB could handle (and of
> course testing could then take hours or days). We could also have a third
> mode, where we generate a corpus as large as the tested version of CWB is
> supposed to handle, and we  just check if CWB survives indexing and a few
> queries, without comparing it to any previous version.
>
> I thought of implementing all this as one or more perl and bash scripts
> that do both indexing and querying by using Stefan's modules, although
> Stefan seemed to have doubts about whether that's a good idea... I don't
> really see any low effort alternative... or?
>
> Does this sound reasonable? Comments? Alternative ideas?
>
> Please feed back.
>
> Regards,
>
> Marco
>
>
>
>