[CWB] INVALID_CTRL marking \n wrongly? (schtepf)

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Jan 5 16:53:57 CET 2011


Yes, known issue. Me & Stefan were actually talking about precisely this at the start of Oct when term hit and we suddenly had no more time for programming. 

The current situation is clearly wrong BUT there are certain implications regarding parity of treatment of C0 control chars in Latin1 vs utf8 so it's not obvious what the Right Thing is. 

To solve the immediate issue I have changed the INVALID_CTRL macro and bumped the level-3 version number. Please get the latest commit and recompile.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Alberto Simões
Sent: 05 January 2011 15:36
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] INVALID_CTRL marking \n wrongly? (schtepf)



On 05/01/2011 15:26, Alberto Simões wrote:
>
> For what I can see, buffer is being called after fgets directly, without
> any kind of pre-process. Therefore, the newline keeps there until
> INVALID_CTRL marks it as invalid.
>
> Stephan or Andrew?

svn blame blames schtepf for that code.
Thanks
>
> Thanks
> Hug
> Alberto
>
> On 05/01/2011 15:12, Alberto Simões wrote:
>> Hello,
>>
>> As far as I can tell, cl_string_validate_encoding is being called with a
>> string that ends with a new line (in fact, the first line of the file
>> being processed), and INVALID_CTRL marks the newline as an invalid
>> character.
>>
>> Wondering if this is a recent change or, if not, why this is happening
>> on this machine.
>>
>> Thanks
>>
>>
>>
>> On 05/01/2011 14:22, Alberto Simões wrote:
>>>
>>> Found out that encode is failing:
>>>
>>> [ambs at search CWB]$ /share/apps/amalandro/bin/cwb-encode -s -x -U '' -R
>>> tmp/registry/vss -d tmp/vss -f data/vrt/VeryShortStories.vrt -p - -P
>>> word -P pos -P lemma -0 collection -S 'story:0+num+title+author+year' -S
>>> 'chapter:0+num' -S 'p:0' -S 's:0'
>>> Encoding error: an invalid byte or byte sequence for charset "latin1"
>>> was encountered.
>>>
>>> And VeryShortStories.vrt does not include outside latin1 chars.
>>>
>>> So, probably CWB is not compiling correctly?
>>>
>>> Thanks
>>>
>>>
>>> On 04/01/2011 22:22, Alberto Simões wrote:
>>>> Hello
>>>>
>>>> I am trying to install CWB on a cluster, and when running make check, I
>>>> get a lot of errors (bellow). This is CWB and Perl CWB from svn head.
>>>> Let me know if you have any idea of what is going on.
>>>>
>>>> Thanks
>>>>
>>>> [ambs at search CWB]$ make test
>>>> PERL_DL_NONLAZY=1 /share/apps/amalandro/perls/perl-5.12.2/bin/perl
>>>> "-MExtUtils::
>>>> Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
>>>> t/00_load.t ............ ok
>>>> t/10_cwb_tools.t ....... okt/11_cwb_file.t ........ ok
>>>> t/12_cwb_tempfile.t .... ok
>>>> t/13_cwb_shell.t ....... ok
>>>> t/14_cwb_registry.t .... ok
>>>> t/20_encode_vss.t ...... 1/6
>>>> # Failed test 'corpus encoding and indexing'
>>>> # at t/20_encode_vss.t line 42.
>>>> # VSS corpus encoded in 0.1 seconds
>>>> # data file 'story_num.avs' is corrput
>>>> # failed to create data file 'word.huf.syn'
>>>> # failed to create data file 'word.hcd'
>>>> # data file 'lemma.lexicon' is corrput
>>>> # data file 'story_author.avs' is corrput
>>>> # failed to create data file 'word.crc'
>>>> # data file 'story_year.avx' is corrput
>>>> # data file 'story_num.rng' is corrput
>>>> # failed to create data file 'pos.huf.syn'
>>>> # data file 'story.rng' is corrput
>>>> # data file 'story_year.avs' is corrput
>>>> # data file 'chapter_num.avx' is corrput
>>>> # data file 'pos.lexicon.idx' is corrput
>>>> # data file 'story_num.avx' is corrput
>>>> # data file 'chapter.rng' is corrput
>>>> # data file 'story_year.rng' is corrput
>>>> # failed to create data file 'lemma.corpus.cnt'
>>>> # data file 'chapter_num.rng' is corrput
>>>> # data file 'lemma.lexicon.idx' is corrput
>>>> # data file 'story_title.rng' is corrput
>>>> # failed to create data file 'word.huf'
>>>> # failed to create data file 'lemma.crx'
>>>> # failed to create data file 'pos.corpus.cnt'
>>>> # failed to create data file 'pos.lexicon.srt'
>>>>
>>>> # Failed test 'validation of created data files'
>>>> # at t/20_encode_vss.t line 68.
>>>>
>>>> # Failed test 'validation of generated registry entry'
>>>> # at t/20_encode_vss.t line 80.
>>>> Use of uninitialized value $mode in bitwise and (&) at
>>>> t/20_encode_vss.t
>>>> line 85.
>>>>
>>>> # Failed test 'correct file access permissions (word.huf)'
>>>> # at t/20_encode_vss.t line 85.
>>>> # got: '0000'
>>>> # expected: '0640'
>>>> CWB::OpenFile: Can't open file/pipe 'tmp/vss/.info' in mode '<': No
>>>> such
>>>> file or directory at t/20_encode_vss.t line 87
>>>> # Looks like you planned 6 tests but ran 5.
>>>> # Looks like you failed 4 tests of 5 run.
>>>> # Looks like your test exited with 2 just after 5.
>>>> t/20_encode_vss.t ...... Dubious, test returned 2 (wstat 512, 0x200)
>>>> Failed 5/6 subtests
>>>> t/30_cqp_basic.t ....... 1/17 # TODO: write many, many more tests for
>>>> CWB::CQP
>>>> t/30_cqp_basic.t ....... ok
>>>>
>>>
>>
>

-- 
Alberto Simões
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list