[Sigwac] CLASSLA web corpora of Croatian, Serbian and Slovenian
Taja Kuzman
taja.kuzman at ijs.si
Fri Jun 23 14:10:00 CEST 2023
**
*Dear all, *
*
The CLASSLA Knowledge centre for South Slavic languages
(https://www.clarin.si/info/k-centre/
<https://www.clarin.si/info/k-centre/>) is delighted to announce the
release of the pilot versions (v0.1) of the CLASSLA web corpora for
Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian
(1.9 billion words). They are available for querying via the CLARIN.SI
concordancers (https://www.clarin.si/ske/#open
<https://www.clarin.si/ske/#open>). The main features of the newly
released corpora, aside from their large size and recency (crawled in
2022) is their automatic enrichment with genre information
(https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>)
and their linguistic processing with the improved CLASSLA-Stanza
annotation pipeline (https://pypi.org/project/classla/
<https://pypi.org/project/classla/>). The pilot versions of these
corpora are intended to gather valuable user feedback, while the
official release (v1.0) of the three existing corpora, along with web
corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is
scheduled for later this year.
We warmly welcome you to explore our corpora and feel free to reach out
to us at helpdesk.classla at clarin.si
<mailto:helpdesk.classla at clarin.si>with any ideas for improvements. You
are also invited to read our blog post on the use of CLASSLA web corpora
via the open CLARIN.SI concordancers:
https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>.
If you are interested in South Slavic resources and technologies, we
also invite you to join the CLASSLA mailing list
(https://mailman.ijs.si/mailman/listinfo/classla
<https://mailman.ijs.si/mailman/listinfo/classla>) and to follow the
CLARIN.SI infrastructure on Twitter (https://twitter.com/ClarinSlovenia
<https://twitter.com/ClarinSlovenia>).*
Best regards,
Taja Kuzman, Nikola Ljubešić and many other CLASSLAers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/sigwac/attachments/20230623/76be2d15/attachment.html>
More information about the Sigwac
mailing list