[CWB] Unable to index a corpus

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Aug 4 00:25:02 CEST 2017


Hi Jorge,

No, it's rather simpler than that:

Step 1 - index the corpus using command-line CWB (wherever you like on the system, as long as the files/directories you create are in a location on the file system where the web server's user account has permission to read them)

Step 2 - go to the "Install new corpus" page in CQPweb, and click on the link at the top that says "Click here to install a corpus you have already indexed in CWB."

Step 3 - specify the location of the registry file. (this will be copied into CQPweb's own registry if not already there; the index files themselves will not be copied or moved.)

Step 4 - once you've installed the corpus thusly, proceed onto the other installation steps (generate your text metadata from the XML attributes  on <text>, setup frequency lists, etc.)

At this point the corpus ought to behave identically to one set up in CQPweb from the start.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of VIVALDI PALATRESI, JORGE
Sent: 03 August 2017 10:05
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Unable to index a corpus

Andrew, Stefan,

As I have fully access to web data directories, I will try to manually
index my corpora and copy index file and registry entries to the right
places and adjust paths accordingly. According to this suggetion the
full procedure would be a follow:
- use CQPweb to register the corpus
- it will fail to index so I will do it manually and copy files in the
data directories
At this point, will CQPweb see the new indexed corpus?

Regarding the metadata, each corpus file must have it own metadata.
Therefore the corpus cqp file should have the following format:
  <text id="m00105" title="title of document m00105"
domain="medicine"> ... </text>
  <text id="d00016" title="title of document d00016" domain="law"> ... </text>
  ...
Assuming this is correct. May I perform the same queries to this
corpus that in any other corpus indexed with CQPweb regular procedure?

Thank you very much for your help

Best,
Jorge


2017-08-02 9:00 GMT+02:00, Stefan Evert <stefanML at collocations.de>:
>
>> On 2 Aug 2017, at 02:54, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
>>
>> At present you have a choice of 3 bodges available in command-line
>> cwb-encode: (a) with +N, to automatically rename nested elements so you
>> get tag1, tag2, tag3 as your attributes; (b) with no +N, to treat every
>> new <tag> as the beginning of a new non-nested region even if the previous
>> one is unclosed; (c) with +0, to totally ignore nested regions.
>
> I think Jorge wants to go with the :0 solution (not "+0", by the way), which
> he found in the Corpus Encoding Tutorial.  The main question was how to tell
> CQPweb to use this option when indexing the corpus.
>
> Jorge, if you have admin access to the Web server('s data directories), it's
> usually better to index the CWB corpus yourself (perhaps even on your local
> computer), and then simply copy the index files and registry entry to the
> Web server, put them into the right directories and adjust paths
> accordingly.
>
> I always use this approach, even for small corpora, and try to put all the
> metadata into <text> tags so that the entire CQPweb installation procedure
> runs from the pre-indexed corpus and I don't have to upload any additional
> files via the Web interface.  Works very well for me.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>


-- 
Jorge Vivaldi Palatresi
Institut Universitari de Lingüística Aplicada
Universitat Pompeu Fabra
C/ Roc Boronat, 138
08018 Barcelona
Espanya

+34 93 542 2332
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list