[CWB] Unable to index a corpus

VIVALDI PALATRESI, JORGE jorge.vivaldi at upf.edu
Fri Aug 4 16:32:44 CEST 2017


Hi Andrew,
I understand the suggested procedure.
Thank you for your valuable help.
Best,
Jorge


El divendres, 4 d’agost de 2017, Hardie, Andrew <a.hardie at lancaster.ac.uk>
va escriure:

> Hi Jorge,
>
> No, it's rather simpler than that:
>
> Step 1 - index the corpus using command-line CWB (wherever you like on the
> system, as long as the files/directories you create are in a location on
> the file system where the web server's user account has permission to read
> them)
>
> Step 2 - go to the "Install new corpus" page in CQPweb, and click on the
> link at the top that says "Click here to install a corpus you have already
> indexed in CWB."
>
> Step 3 - specify the location of the registry file. (this will be copied
> into CQPweb's own registry if not already there; the index files themselves
> will not be copied or moved.)
>
> Step 4 - once you've installed the corpus thusly, proceed onto the other
> installation steps (generate your text metadata from the XML attributes  on
> <text>, setup frequency lists, etc.)
>
> At this point the corpus ought to behave identically to one set up in
> CQPweb from the start.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <javascript:;> [mailto:
> cwb-bounces at sslmit.unibo.it <javascript:;>] On Behalf Of VIVALDI
> PALATRESI, JORGE
> Sent: 03 August 2017 10:05
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Unable to index a corpus
>
> Andrew, Stefan,
>
> As I have fully access to web data directories, I will try to manually
> index my corpora and copy index file and registry entries to the right
> places and adjust paths accordingly. According to this suggetion the
> full procedure would be a follow:
> - use CQPweb to register the corpus
> - it will fail to index so I will do it manually and copy files in the
> data directories
> At this point, will CQPweb see the new indexed corpus?
>
> Regarding the metadata, each corpus file must have it own metadata.
> Therefore the corpus cqp file should have the following format:
>   <text id="m00105" title="title of document m00105"
> domain="medicine"> ... </text>
>   <text id="d00016" title="title of document d00016" domain="law"> ...
> </text>
>   ...
> Assuming this is correct. May I perform the same queries to this
> corpus that in any other corpus indexed with CQPweb regular procedure?
>
> Thank you very much for your help
>
> Best,
> Jorge
>
>
> 2017-08-02 9:00 GMT+02:00, Stefan Evert <stefanML at collocations.de
> <javascript:;>>:
> >
> >> On 2 Aug 2017, at 02:54, Hardie, Andrew <a.hardie at lancaster.ac.uk
> <javascript:;>> wrote:
> >>
> >> At present you have a choice of 3 bodges available in command-line
> >> cwb-encode: (a) with +N, to automatically rename nested elements so you
> >> get tag1, tag2, tag3 as your attributes; (b) with no +N, to treat every
> >> new <tag> as the beginning of a new non-nested region even if the
> previous
> >> one is unclosed; (c) with +0, to totally ignore nested regions.
> >
> > I think Jorge wants to go with the :0 solution (not "+0", by the way),
> which
> > he found in the Corpus Encoding Tutorial.  The main question was how to
> tell
> > CQPweb to use this option when indexing the corpus.
> >
> > Jorge, if you have admin access to the Web server('s data directories),
> it's
> > usually better to index the CWB corpus yourself (perhaps even on your
> local
> > computer), and then simply copy the index files and registry entry to the
> > Web server, put them into the right directories and adjust paths
> > accordingly.
> >
> > I always use this approach, even for small corpora, and try to put all
> the
> > metadata into <text> tags so that the entire CQPweb installation
> procedure
> > runs from the pre-indexed corpus and I don't have to upload any
> additional
> > files via the Web interface.  Works very well for me.
> >
> > Best,
> > Stefan
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it <javascript:;>
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> >
>
>
> --
> Jorge Vivaldi Palatresi
> Institut Universitari de Lingüística Aplicada
> Universitat Pompeu Fabra
> C/ Roc Boronat, 138
> 08018 Barcelona
> Espanya
>
> +34 93 542 2332
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <javascript:;>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <javascript:;>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>


-- 
Jorge Vivaldi Palatresi
Institut Universitari de Lingüística Aplicada
Universitat Pompeu Fabra
C/ Roc Boronat, 138
08018 Barcelona
Espanya

+34 93 542 2332
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170804/e1ad85c2/attachment.html>


More information about the CWB mailing list