[CWB] unicode problems with Greek and OCS

Serge Heiden slh at ens-lyon.fr
Wed Mar 11 08:15:30 CET 2015


Dear Gabriele,

Bastien Kindt is reputed to lemmatize, and probably pos tag, Old Greek 
texts (see contact below).
He may accept to send you back graciously the lemmatized version of the 
texts you would send him.
We redirected at least 3 different TXM users to him but I don't know the 
results.
Please get in touch if you would like to discuss about that, we are 
looking for a communautary
solution to this need.

All the best,
Serge

---
Bastien KINDT
Projet de Recherche en Lexicologie Grecque
Institut orientaliste
Collège Érasme
Place Blaise Pascal, 1
B-1348 Louvain-la-Neuve
BELGIQUE
tel.: 00 32 10 47 44 16
bastien.kindt at student.uclouvain.be
bastien.kindt at brepols.net
http://tpg.fltr.ucl.ac.be
----


Le 11/03/2015 05:58, Gabriele Brandolini a écrit :
>
 > Dear Ruprecht, Andrew and Stefan
 >
 > I followed your issue about encoding Old Greek texts.
 >
 > I also would like to cwb encode texts in this language expecially old
 > texts of the Fathers of the Church. But I've not yet got a PoS tagger
 > for such a language. We just planned to work on it to train
 > TreeTagger. But as I know it isn't ready yet.
 >
 > Do you, Ruprecht, know if there is any available?
 >
 > About your list of greek words in your email of 14 31 I noticed that
 > they are mostly uncorrect. As the initial letter (alfa or eta or
 > epsilon) were dropped out with its accent and spirit. I don't know if
 > this has something to do with the encoding error messages you get.
 > Just I wanted to point out it, maybe it can be of any help.
 >
 > Good work and good luck!
 >
 > Gabriele
 >
 > Il 10/mar/2015 14:31 "Ruprecht von Waldenfels"
 > <ruprecht.waldenfels at gmx.net <mailto:ruprecht.waldenfels at gmx.net>> ha
 > scritto:
 >
 > Dear List, so my second problem, this time with Ancient Greek. I
 > cannot easily reproduce this with a 2-line corpus, because I don't
 > know where the culprit is. I am posting the CWB Output instead, maybe
 > this is already enough.
 >
 > What I am trying to do: I am trying to align three documents, one
 > Greek and two Slavic texts, using the aligVerse structural element.
 > The two Slavic ones align fine, the Greek gives me the following
 > error: rvw at rvw-Latitude-E6410:/data/PROIEL$
 > /opt/CWBUTF8/cwb/utils/cwb-align -r /data/PROIEL/Registry -S
 > aligVerse -o out.align NTESTAMENT_GR NTESTAMENT_MN aligVerse OPENING
 > NTESTAMENT_GR [147613 tokens, 7497 <aligVerse> regions] OPENING
 > NTESTAMENT_MN [71935 tokens, 7497 <aligVerse> regions] OPENING
 > prealignment [NTESTAMENT_GR.aligVerse: 7497 regions,
 > NTESTAMENT_MN.aligVerse: 7497 regions] LEXICON SIZE: 18085 / 10132
 > FEATURE: character count, weight=1 ... [1] FEATURE: Shared words,
 > threshold=40.0%, weight=50 ... [0] FEATURE: 3-grams, weight=3 ... CL:
 > major error, invalid UTF8 string passed to cl_string_canonical... CL:
 > major error, invalid UTF8 string passed to cl_string_canonical... CL:
 > major error, invalid UTF8 string passed to cl_string_canonical...
 > [21952] FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8
 > string passed to cl_string_canonical... CL: major error, invalid UTF8
 > string passed to cl_string_canonical... CL: major error, invalid UTF8
 > string passed to cl_string_canonical... CL: major error, invalid UTF8
 > string passed to cl_string_canonical... [614656] [636609 features
 > allocated] [520402 entries in source text feature map] [246622
 > entries in target text feature map] PASS 2: Setting character count
 > weight. PASS 2: Processing shared words (th=40.0%). PASS 2:
 > Processing 3-grams. CL: major error, invalid UTF8 string passed to
 > cl_string_canonical... CL: major error, invalid UTF8 string passed to
 > cl_string_canonical... PASS 2: Processing 4-grams. CL: major error,
 > invalid UTF8 string passed to cl_string_canonical... CL: major error,
 > invalid UTF8 string passed to cl_string_canonical... PASS 2: Creating
 > character counts. [checking pointers] ERROR: fcount1[1387]=24
 > r->w2f1[1388]-r->w2f1[1387]=22 w=``ἥξουσιν'' ERROR: fcount1[1388]=50
 > r->w2f1[1389]-r->w2f1[1388]=52 w=``ἀνακλιθήσονται'' ERROR:
 > fcount1[1783]=24 r->w2f1[1784]-r->w2f1[1783]=22 w=``θάνατον'' ERROR:
 > fcount1[1784]=50 r->w2f1[1785]-r->w2f1[1784]=52 w=``ἐπαναστήσονται''
 > ERROR: fcount1[3037]=20 r->w2f1[3038]-r->w2f1[3037]=16 w=``δυνατά''
 > ERROR: fcount1[3039]=48 r->w2f1[3040]-r->w2f1[3039]=52
 > w=``ἀκολουθήσαντές'' ERROR: fcount1[3784]=20
 > r->w2f1[3785]-r->w2f1[3784]=18 w=``ἤλθατε'' ERROR: fcount1[3785]=50
 > r->w2f1[3786]-r->w2f1[3785]=52 w=``ἀποκριθήσονται'' ERROR:
 > fcount1[4459]=32 r->w2f1[4460]-r->w2f1[4459]=30 w=``ἐπιθυμίαι''
 > ERROR: fcount1[4460]=50 r->w2f1[4461]-r->w2f1[4460]=52
 > w=``εἰσπορευόμεναι'' ERROR: fcount1[4998]=20
 > r->w2f1[4999]-r->w2f1[4998]=18 w=``Ἤρξατο'' ERROR: fcount1[4999]=46
 > r->w2f1[5000]-r->w2f1[4999]=48 w=``ἠκολουθήκαμέν'' ERROR:
 > fcount1[5038]=36 r->w2f1[5039]-r->w2f1[5038]=34 w=``ἐγγίζουσιν''
 > ERROR: fcount1[5039]=50 r->w2f1[5040]-r->w2f1[5039]=52
 > w=``εἰσπορευόμενοι'' ERROR: fcount1[7009]=32
 > r->w2f1[7010]-r->w2f1[7009]=30 w=``πλουσίους'' ERROR:
 > fcount1[7010]=46 r->w2f1[7011]-r->w2f1[7010]=48 w=``ἀντικαλέσωσίν''
 > ERROR: fcount1[8582]=20 r->w2f1[8583]-r->w2f1[8582]=18 w=``ἐξάγει''
 > ERROR: fcount1[8583]=50 r->w2f1[8584]-r->w2f1[8583]=52
 > w=``ἀκολουθήσουσιν'' ERROR: fcount1[9942]=20
 > r->w2f1[9943]-r->w2f1[9942]=24 w=``ἅρματι'' ERROR: fcount1[9943]=56
 > r->w2f1[9944]-r->w2f1[9943]=52 w=``ἀναγινώσκοντος'' ERROR:
 > fcount1[10119]=48 r->w2f1[10120]-r->w2f1[10119]=44
 > w=``μεταπέμψασθαί'' ERROR: fcount1[10120]=48
 > r->w2f1[10121]-r->w2f1[10120]=52 w=``εἰσκαλεσάμενος'' ERROR:
 > fcount1[10553]=28 r->w2f1[10554]-r->w2f1[10553]=24 w=``ἐτάραξαν''
 > ERROR: fcount1[10554]=48 r->w2f1[10555]-r->w2f1[10554]=52
 > w=``ἀνασκευάζοντες'' ERROR: fcount1[10622]=24
 > r->w2f1[10623]-r->w2f1[10622]=20 w=``Τρῳάδος'' ERROR:
 > fcount1[10623]=48 r->w2f1[10624]-r->w2f1[10623]=52
 > w=``εὐθυδρομήσαμεν'' ERROR: fcount1[11159]=48
 > r->w2f1[11160]-r->w2f1[11159]=44 w=``ἀποσπασθέντας'' ERROR:
 > fcount1[11160]=52 r->w2f1[11161]-r->w2f1[11160]=56
 > w=``εὐθυδρομήσαντες'' ERROR: fcount1[12054]=20
 > r->w2f1[12055]-r->w2f1[12054]=18 w=``πλάνης'' ERROR:
 > fcount1[12055]=50 r->w2f1[12056]-r->w2f1[12055]=52
 > w=``ἀπολαμβάνοντες'' ERROR: fcount1[12422]=12
 > r->w2f1[12423]-r->w2f1[12422]=10 w=``νοός'' ERROR: fcount1[12423]=50
 > r->w2f1[12424]-r->w2f1[12423]=52 w=``αἰχμαλωτίζοντά'' ERROR:
 > fcount1[14334]=40 r->w2f1[14335]-r->w2f1[14334]=38 w=``ἐπαιρόμενον''
 > ERROR: fcount1[14335]=54 r->w2f1[14336]-r->w2f1[14335]=56
 > w=``αἰχμαλωτίζοντες'' ERROR: fcount1[14641]=40
 > r->w2f1[14642]-r->w2f1[14641]=38 w=``κεκυρωμένην'' ERROR:
 > fcount1[14642]=50 r->w2f1[14643]-r->w2f1[14642]=52
 > w=``ἐπιδιατάσσεται'' ERROR: fcount1[14878]=32
 > r->w2f1[14879]-r->w2f1[14878]=34 w=``προέγραψα'' ERROR:
 > fcount1[14879]=54 r->w2f1[14880]-r->w2f1[14879]=52
 > w=``ἀναγινώσκοντες'' ERROR: fcount1[15698]=36
 > r->w2f1[15699]-r->w2f1[15698]=34 w=``ἐπιστεύθην'' ERROR:
 > fcount1[15699]=46 r->w2f1[15700]-r->w2f1[15699]=48
 > w=``ἐνδυναμώσαντί'' ERROR: fcount1[16170]=32
 > r->w2f1[16171]-r->w2f1[16170]=30 w=``ἀνέξονται'' ERROR:
 > fcount1[16171]=50 r->w2f1[16172]-r->w2f1[16171]=52
 > w=``ἐπισωρεύσουσιν'' ERROR: fcount1[16815]=32
 > r->w2f1[16816]-r->w2f1[16815]=30 w=``ἐνυβρίσας'' ERROR:
 > fcount1[16816]=50 r->w2f1[16817]-r->w2f1[16816]=52
 > w=``Ἀναμιμνῄσκεσθε'' ERROR: fcount1[17621]=40
 > r->w2f1[17622]-r->w2f1[17621]=42 w=``ἀπεσταλμένα'' ERROR:
 > fcount1[17622]=56 r->w2f1[17623]-r->w2f1[17622]=54 w=``εἴκοσι
 > τέσσαρες'' ERROR: fcount1[17793]=28 r->w2f1[17794]-r->w2f1[17793]=29
 > w=``μάρτυσίν'' ERROR: fcount1[17794]=93
 > r->w2f1[17795]-r->w2f1[17794]=92 w=``χιλίας διακοσίας ἑξήκοντα''
 > ERROR: fcount1[17937]=24 r->w2f1[17938]-r->w2f1[17937]=26
 > w=``χαλινῶν'' ERROR: fcount1[17938]=60
 > r->w2f1[17939]-r->w2f1[17938]=58 w=``χιλίων ἑξακοσίων'' ERROR:
 > fcount1[17967]=36 r->w2f1[17968]-r->w2f1[17967]=34 w=``καυματίσαι''
 > ERROR: fcount1[17968]=50 r->w2f1[17969]-r->w2f1[17968]=52
 > w=``ἐκαυματίσθησαν''
 >
 >
 > Again, I would be very thankful for help.
 >
 > Best! Ruprecht
 >
 >
 >
 >
 >
 > Am 10.03.2015 um 12:07 schrieb Ruprecht von Waldenfels:
 >> Hi Andrew, YES! This does solve the problem. I was thinking this
 >> setting would only concern tokens, not the lemma attribute, but now
 >> I understand that this was a wrong assumption. Thank you! I will
 >> now look at the other problem - because that, as it turns out, is
 >> unrelated. Thanks A LOT! Ruprecht Am 10.03.2015 um 12:02 schrieb
 >> Hardie, Andrew:
 >>>
 >>> Is the context size measured in characters? If so, that would
 >>> explain the problem, since “characters” = bytes still.
 >>>
 >>>
 >>>
 >>> If changing the context width to a given number of words fixes
 >>> the issue, then that is the solution.
 >>>
 >>>
 >>>
 >>> I have been working on a patch to fix this, but have not
 >>> completed it yet.
 >>>
 >>>
 >>>
 >>> Andrew.
 >>>
 >>>
 >>>
 >>> *From:*cwb-bounces at sslmit.unibo.it
 >>> <mailto:cwb-bounces at sslmit.unibo.it>
 >>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Ruprecht von
 >>> Waldenfels *Sent:* 10 March 2015 09:54 *To:* cwb at sslmit.unibo.it
 >>> <mailto:cwb at sslmit.unibo.it> *Subject:* [CWB] unicode problems
 >>> with Greek and OCS
 >>>
 >>>
 >>>
 >>> Dear List,
 >>>
 >>> I am using CWB 3.4.8 on 64 bit Ubuntu 14.10. After encoding a
 >>> text in Old Church Slavonic, I get invalid UTF-8 character
 >>> errors; I seem to get them only in sgml mode (I also get them
 >>> during alignment with the Ancient Greek translation source, which
 >>> might be a related problem, but I am not sure.)
 >>>
 >>> In order to pinpoint the problem with the Old Church Slavonic
 >>> text, I have reduced the text in question to two bible verses.
 >>> The text can be found here: www.parasolcorpus.org/test.txt
 >>> <http://www.parasolcorpus.org/test.txt>
 >>>
 >>> I encode the corpus with the following commands:
 >>> /opt/CWBUTF8/cwb/utils/cwb-encode -d Data/ntestament_tt -f
 >>> test.txt -R /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB -P
 >>> lemma -P id -P alig -P pos -P tag -S aligVerse:0
 >>> /opt/CWBUTF8/cwb/utils/cwb-makeall -r /data/PROIEL/Registry
 >>> NTESTAMENT_TT
 >>>
 >>> There is no problem in text mode:
 >>>
 >>>
 >>>
 >>> However, in sgml mode, some lemmas get truncated and do not
 >>> contain valid utf8 anymore. For example, the lemma of "с҃вщаѩи"
 >>> is such a token. This problem does NOT appear if I search for
 >>> this token itself, it ONLY and consistently appears if I search
 >>> for a different token and the problematic token is in the result
 >>> set:
 >>>
 >>>
 >>> To sum up: I get the problem only if I search for a neighboring
 >>> token in sgml mode. I don't get it if I search for the token
 >>> itself, and I don't get it in text mode. I have reduced the
 >>> problem to w 50-token text, and the problem persists.
 >>>
 >>> Any help would be greatly appreciated! Best, Ruprecht
 >>>
 >>>
 >>>
 >>>
 >>>
 >>> _______________________________________________ CWB mailing list
 >>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
 >>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
 >>
 >>
 >>
 >> _______________________________________________ CWB mailing list
 >> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
 >> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
 >
 >
 > _______________________________________________ CWB mailing list
 > CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
 > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
 >
 >
 >
 > _______________________________________________ CWB mailing list
 > CWB at sslmit.unibo.it
 > http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883

-------------- section suivante --------------
Une pi�ce jointe HTML a �t� nettoy�e...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/7d741453/attachment-0001.html>


More information about the CWB mailing list