<div dir="ltr">Dear Andrew,<div><br></div><div>I have a question regarding your comment </div><div><br></div><div>> - ensure that MySQL is using a location for temporary files which is on a *separate physical disk* to the location where actual tables for the CQPweb database is.</div><div><br></div><div>Do you mean the location indicated with the config file variable $cqpweb_tempdir ? Or would it be the value that could be given to other mysql configuration variables like tmpdir?</div><div><br></div><div>Best,</div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div>--</div><div><div>José Manuel Martínez Martínez</div><div><a href="https://chozelinek.github.io" target="_blank">https://chozelinek.github.io</a></div></div></div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 25, 2020 at 1:47 PM José Manuel Martínez Martínez <<a href="mailto:chozelinek@gmail.com">chozelinek@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Stefan and Andrew,<div><br><div>Thank you for the quick feedback. Now, I understand much better where I can improve the performance of these processes.</div><div>Regarding the issues MySQL writing and reading from disk, I'm using Amazon's cloud solutions, and in particular EFS. This is like a virtual local network disk that can be mounted on any virtual instance. It is in general pretty fast, but the bandwidth is bound to a certain amount of time, so I think that after some processing on the same disk (in the end a lot of information is being transferred from and to the disk), it becomes quite slow. It is convenient because different computers can access the same indices, so I avoid data redundancy. But it has a performance cost.</div><div><br></div><div>Yes, I'm using CQPweb 3.2.6 because I wanted to work in a very stable version. Happy to test and jump to a more recent one if it is not broken. I need CQPweb to be in production.</div><div><br></div><div>I will try to optimize the process. At some point, I'll share my experience with the community if time permits.</div><div><div><br clear="all"><div><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div>--</div><div><div>José Manuel Martínez Martínez</div><div><a href="https://chozelinek.github.io" target="_blank">https://chozelinek.github.io</a></div></div></div></div></div></div></div></div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 16, 2020 at 4:20 AM Hardie, Andrew <<a href="mailto:a.hardie@lancaster.ac.uk" target="_blank">a.hardie@lancaster.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Just a couple of additions to Stefan's answers. <br>
<br>
I've added the capacity to specify that a particular annotation (p-attribute) should not have freq tables built, but only in 3.3 (trunk). (I guess you are on the 3.2 branch, José) In 3.3, under "Manage annotation" there is a control called "Needs FT", set to "Y" by default. The only effect of switching to N is that the attribute is absent from "Frequency lists", "keywords" and "collocations". <br>
<br>
As Stefan points out, the bottleneck is in MySQL, but specifically it's an issue of disk *access*. Creating the freq table requires the creation of large temporary tables to store the results of intermediate "select..." queries. For large corpora, these tables are too big to be held in RAM, so they are stored to temporary disk space. If your MySQL daemon uses the same physical disk for temp space and storage of actual tables, then its read-accesses and write-accesses will be constantly interrupting one another to read one table and write to another. This can cause MAJOR slowdown.<br>
<br>
Possible remedies - not tested by me, sorry, but theoretically useful!<br>
<br>
- ensure that MySQL is using a location for temporary files which is on a *separate physical disk* to the location where actual tables for the CQPweb database is.<br>
<br>
- or, get a faster disk (RAID?) for the single location<br>
<br>
- or, get enough RAM to do it all without writing temp tables to disk<br>
<br>
- or, block all use of the server by other users during freq table setup (again, to give the MySQL server connection doing the freq table all available disk read/write bandwidth) <br>
<br>
- ALSO: creating freq tables is faster for annotations that are set to case-sensitive/accent-sensitive. So, consider setting annotations to CS/AS if you don't need C/A insensitivity. Again this is not available in 3.2.<br>
<br>
The creation of the per-text frequency data which Stefan mentions is actually normally pretty quick, compared to the freq table building, because unlike the SQL freq tables, no intermediate data is involved: it's just a filter on pipeline from cwb-decode to cwb-encode. <br>
<br>
best<br>
<br>
Andrew.<br>
<br>
<br>
-----Original Message-----<br>
From: <a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank">cwb-bounces@sslmit.unibo.it</a> <<a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank">cwb-bounces@sslmit.unibo.it</a>> On Behalf Of Stefan Evert<br>
Sent: 14 November 2020 10:38<br>
To: CWBdev Mailing List <<a href="mailto:cwb@sslmit.unibo.it" target="_blank">cwb@sslmit.unibo.it</a>><br>
Subject: Re: [CWB] Best practices to manage big corpora in CQPweb<br>
<br>
<br>
Hi José,<br>
<br>
building frequency lists is the most time-consuming step of corpus installation in CQPweb and can be tedious, but your corpora are still in a reasonable size range (both wrt. token count and number of texts).<br>
<br>
I definitely wouldn't expect a 140M corpus to take 10 hours. One possibility is the fact that you're indexing 20 p-attributes, even though CQPweb won't be able to work with them anyway (except to do a keyword or collocation analysis). IIRC, CQPweb indexes unique combinations across all p-attributes, so this is going to be a huge and very expensive database.<br>
<br>
If you only need them for CQP queries, a work-around could be to remove them from the registry file while installing the corpus in CQPweb (so CQPweb won't know about them) and then put them back in later (so they're available for CQP queries).<br>
<br>
There are two bottlenecks in building frequency lists:<br>
<br>
a) Creating per-text frequency lists is done in PHP and uses only a single thread. This is something you can't get around.<br>
<br>
b) Indexing frequency tables in MySQL can take a very long time (I always feel that MySQL could do better there …). If this is your key bottleneck, you should try to optimise the configuration of your MySQL server, e.g. making it use more threads. Are you sure that the MySQL data store is on a fast hard disk?<br>
<br>
Can you watch "top" during the indexing and check which programs are taking up so much time?<br>
<br>
Best,<br>
Stefan<br>
<br>
<br>
> On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <<a href="mailto:chozelinek@gmail.com" target="_blank">chozelinek@gmail.com</a>> wrote:<br>
><br>
> I'm currently working with several corpora in CQPweb which are fairly big (they will be below the 2,1 billion limit though). The corpora will contain between 7000 to 30000 texts, and the typical size in tokens will range from 500M to 1500M tokens.<br>
><br>
> My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now), indexing a corpus from the command line and running a python script.<br>
><br>
> I've seen that the process of creating the frequency lists with offline-freqlists.php is my current bottleneck. I think the process uses at max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it took my server around 10 hours. My next will be on a corpus of around 8000 texts 500M tokens. Could this take up to 40 hours to be ready to be used in CQPweb?<br>
><br>
> How can I optimize the process? How do you usually do it?<br>
> Any tips and tricks on how to handle this very big corpora will be very appreciated.<br>
><br>
> I think that the part that took longer was when it started generating the frequency lists for every positional attribute. If this assumption is right, I could skip some of the positional attributes (I have twenty of them, eleven of them are booleans True, False values only, the interesting ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).<br>
><br>
<br>
_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>
<a href="https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&amp;reserved=0" rel="noreferrer" target="_blank">https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&amp;reserved=0</a><br>
_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>
<a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" rel="noreferrer" target="_blank">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
</blockquote></div>
</blockquote></div>