[CWB] Trying to deal with tag problems when encoding

Fri May 19 08:27:04 CEST 2017

On Fri, May 19, 2017 at 2:01 AM, Stefan Evert <stefanML at collocations.de>
wrote:

Hi Stefan,

1. Can a single missing closing tag really have this massive cascading
>> effect on all texts processed after it?
>
>
> Yes, because you're encoding with -S text:0, telling cwb-encode that
> <text> elements can be nested (but any nested ones should be ignored).  So
> if there's a single missing </text> in one of your files, the current
> <text> region will stay open until the end of the corpus and all following
> .vrt files will be nested in it (and hence dropped).
>

Quite right. I've never been able to stop assuming that files are a
meaningful unit to CWB. As a result, I was thinking that whatever the
problem, it should not extend beyond a single file, which isn't the case,
as you point out.

> This is also how you can find the culprit quite easily.  Use cwb-s-decode
> to look for a huge <text> region (very probably the last one in the
> corpus), then check it's id to locate the problematic file.
>
>         cwb-s-decode CC-C -S text_id | tail -1
>
> should do the trick (but do check whether that's really a very large
> region).
>

Excellent! That nailed one of the problematic cases right off the bat, and
gave me the name of the file I needed to fix. Somehow, the final line in
the file ended up being this:

el     el     <     </text>

I assume the source file was truncated before tagging, as it ends not just
in the middle of a phrase, but with an incomplete line -- there should be
five columns in all, and there are only three plus the tag.

By the way, what's the logic at work in this command? Are corpus contents
ordered according to size?

Since the last line doesn't start with "<", it isn't recognized as an XML
> tag and inserted as a literal token (without annotations, so lemma, syn, …
> will be __UNDEF__ at this point).  If you have indexed the corpus
> completely, you can also use CWB to look for
>
>         ".*<.+>.*";
>

Wonderful -- this found another two cases of bad tags! Unfortunately, this
method doesn't show the filename, but the corpus index number. Any
suggestion on getting the name of the file from this? It's encoded as the
unique text identifier, so the info should be there, but show +text_id
doesn't change what is shown.

As a hotfix, just encode without the :0 nesting specifier, i.e.
>
>         … -S text+id+corpus+tagger+label+channel+audience+purpose+type+
> medium+field+area+location+source
>
> cwb-encode then assumes that <text> regions cannot be nested and will
> automatically insert </text> end tag whenever it encounters a new region.
>

I'll give this a shot if I can't manage to identify the two files I need to
fix.

Thanks a ton!
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170519/df37a5e3/attachment.html>