[CWB] Line returns, etc.

Hardie, Andrew a.hardie at lancaster.ac.uk
Sat Sep 30 04:03:30 CEST 2017


Hi Graham,

I am afraid I am not sure what you mean by this --> " formatting a plain text corpus for viewing in cqpweb " WRT line breaks.

To index a corpus in CQPweb, it must in any case be tokenised. So all the original line breaks will be dropped anyway (and replaced by line breaks representing token boundaries). 

The very fact you are asking this question, then, makes me wonder: How have you been processing your input data so far? (to tokenise it, that is) 

That aside:

>> Will search results be affected depending on the presence / absence of line breaks
No.

>> is the removal a waste of energy?
Almost certainly yes. 

>> I imagine it is probably a question of whether the processing is line- or stream-based...
Neither. It's based on an index, which is compiled from a tokenised file (Vertical-format, one token per line) when the corpus is set up for use in CWB / CQPweb.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Graham Ranger -- UAPV
Sent: 25 September 2017 14:35
To: cwb at sslmit.unibo.it
Subject: [CWB] Line returns, etc.

Hello to all,
When formatting a plain text corpus for viewing in cqpweb, I like to remove unwanted line breaks that result from the use of OCR software. 
This can be slightly awkward to implement and I was wondering whether it is strictly necessary. Will search results be affected depending on the presence / absence of line breaks or is the removal a waste of energy? I imagine it is probably a question of whether the processing is line- or stream-based... but I'd appreciate some expert opinion!
Thanks in advance.
Best,
Graham.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list