[Sigwac] CLASSLA web corpora of Croatian, Serbian and Slovenian

Taja Kuzman taja.kuzman at ijs.si
Fri Jun 23 14:10:00 CEST 2023


**

*Dear all, *

*

The CLASSLA Knowledge centre for South Slavic languages 
(https://www.clarin.si/info/k-centre/ 
<https://www.clarin.si/info/k-centre/>) is delighted to announce the 
release of the pilot versions (v0.1) of the CLASSLA web corpora for 
Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian 
(1.9 billion words). They are available for querying via the CLARIN.SI 
concordancers (https://www.clarin.si/ske/#open 
<https://www.clarin.si/ske/#open>). The main features of the newly 
released corpora, aside from their large size and recency (crawled in 
2022) is their automatic enrichment with genre information 
(https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier 
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>) 
and their linguistic processing with the improved CLASSLA-Stanza 
annotation pipeline (https://pypi.org/project/classla/ 
<https://pypi.org/project/classla/>). The pilot versions of these 
corpora are intended to gather valuable user feedback, while the 
official release (v1.0) of the three existing corpora, along with web 
corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is 
scheduled for later this year.


We warmly welcome you to explore our corpora and feel free to reach out 
to us at helpdesk.classla at clarin.si 
<mailto:helpdesk.classla at clarin.si>with any ideas for improvements. You 
are also invited to read our blog post on the use of CLASSLA web corpora 
via the open CLARIN.SI concordancers: 
https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/ 
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>.


If you are interested in South Slavic resources and technologies, we 
also invite you to join the CLASSLA mailing list 
(https://mailman.ijs.si/mailman/listinfo/classla 
<https://mailman.ijs.si/mailman/listinfo/classla>) and to follow the 
CLARIN.SI infrastructure on Twitter (https://twitter.com/ClarinSlovenia 
<https://twitter.com/ClarinSlovenia>).*

Best regards,

Taja Kuzman, Nikola Ljubešić and many other CLASSLAers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/sigwac/attachments/20230623/76be2d15/attachment.html>


More information about the Sigwac mailing list