[CWB] CQPWeb Context Line Breaks

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Aug 7 01:10:38 CEST 2016


Hi Stevie,

I am afraid I don't understand. If your input file contains " a list of tokens (one token per line) " then how is it possible that " new lines within the text file are ignored (so, for example, lines of verse are displayed on a single line) "?

IE, if you are using newline characters (correctly) to indicate token breaks, then how can there *also* be newlines representing actual line breaks as in verse etc.?

On the broader issue: CWB simply does not have the concept of a line break in its data model. Any multi-word structure such as a line or paragraph must be explicitly encoded using XML tags.

Because of this, CQPweb's default behaviour in extended context mode is to insert a line-break after sentence-final punctuation, simply to make the output more readable (as the raw token-stream would print as an undifferentiated block of text).

However, you *can* switch off the default behaviour, and set up a corpus to make a particular s-attrribute (xml region type) to render as a paragraph break. 

This is explained in chapter 9 of the manual.

https://cqpweb.lancs.ac.uk/doc/CQPwebAdminManual.pdf

best

Andrew.


-----Original Message-----
From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Stephen Barrett
Sent: 05 August 2016 11:42
To: Open source development of the Corpus WorkBench
Subject: [CWB] CQPWeb Context Line Breaks

Dear All, 

Apologies for the rather basic question:

We have imported a corpus into CQPWeb using the standard "Install a new corpus" interface. The corpus has not been indexed in CWB beforehand and is very simple indeed, comprising a text file with a number of <text> elements each just containing a list of tokens (one token per line) and no other information.

When the context for a search item is displayed in CQPWeb, new lines within the text file are ignored (so, for example, lines of verse are displayed on a single line). Conversely, prose paragraphs are split with line breaks after each end of sentence punctution mark (full stop, question mark, etc.)

Looking at the DICKENS example corpus we can see that the text is formatted as one would expect, with paragraphs displaying properly and so on and are therefore sure that we are doing something wrong.

The question is, then, is there something we should be doing when we generate our input file to ensure that line breaks within the corpus are preserved and that no extra line breaks (after each sentence) are added? We have searched through the documentation as I'm sure we're missing something obvious and basic but have come up short.

Many thanks in advance.

Stevie


_______________________________________________
CWB mailing list
CWB at liste.sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list