[CWB] "Segmentation fault (core dumped)" on various versions

Stefan Evert stefanML at collocations.de
Wed Jul 24 08:43:48 CEST 2013


On 24 Jul 2013, at 04:30, Scott Sadowsky <ssadowsky at gmail.com> wrote:

> Something very strange is going on. I've replaced my index for this corpus with a third backup copy, and the following happened:
> 
> PERS-DIVER-USENET> "jai"
> 0 matches.                                                  
> PERS-DIVER-USENET> ".+ai"
> Segmentation fault (core dumped)  
> Here the search for "jai", which previously caused a segfault, worked. So all seemed good. But the search returned 0 hits, instead of the 1 which is returned by the command cwb-lexdecode -f -p '.ai' PERS-DIVER-USENET. So something isn't adding up here.

If this is indeed a buffer overflow or so triggered by a faulty index file, it is not surprising that there's somewhat erratic behaviour.

> I suspect the next step is to rebuild the index from scratch, but that involves decompressing a ZIP file with 1.2 million files inside it, which I'd rather avoid if at all possible.

If you have "cwb-make" from the CWB/Perl modules, you can simply trash the ".crc" and ".crx" files (which contain the actual lookup index that appears to be damaged) and rebuild them with

	cwb-make [...] PERS-DIVER-USENET

Of course, make sure you have a backup copy of the corpus beforehand.

You should also be able to rebuild the index files manually with "cwb-makeall" and "cwb-compress-rdx", but those tools sometimes get confused about which files need to be rebuilt in which order.


If you need to try re-encoding from scratch, an easier solution is

	cwb-decode -Cx PERS-DIVER-USENET -ALL | cwb-encode -x [...] <appropriate declarations>

Note that the attribute declarations in the cwb-encode command will be different from the ones you used for the original encoding, because attributes on XML regions are not decoded in proper XML notation.


Hope that one of these steps helps!
Stefan



More information about the CWB mailing list