[CWB] unicode problems with Greek and OCS

Wed Mar 11 12:40:43 CET 2015

Dear Gabriele,

there is a web service that will do morphological analysis and 
lemmatization for Greek:
http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html

However, it does not disambiguate homonyms. One way is to encode all 
possibilities in the corpus, that's what I did (for a different 
project). That's the only resource I know of.

I think the problem you saw with the accented characters might be part 
of the rendering on your system - here, things looked fine, and Stefan 
and Andrew checked that. But thanks for pointing that out!

Best,
Ruprecht

Am 11.03.2015 um 05:58 schrieb Gabriele Brandolini:
>
> Dear Ruprecht, Andrew and Stefan
>
> I followed your issue about encoding Old Greek texts.
>
> I also would like to cwb encode texts in this language expecially old 
> texts of the Fathers of the Church. But I've not yet got a PoS tagger 
> for such a language. We just planned to work on it to train 
> TreeTagger. But as I know it isn't ready yet.
>
> Do you, Ruprecht, know if there is any available?
>
> About your list of greek words in your email of 14 31 I noticed that 
> they are mostly uncorrect. As the initial letter (alfa or eta or 
> epsilon) were dropped out with its accent and spirit.
> I don't know if this has something to do with the encoding error 
> messages you get.
> Just I wanted to point out it, maybe it can be of any help.
>
> Good work and good luck!
>
> Gabriele
>
> Il 10/mar/2015 14:31 "Ruprecht von Waldenfels" 
> <ruprecht.waldenfels at gmx.net <mailto:ruprecht.waldenfels at gmx.net>> ha 
> scritto:
>
>     Dear List,
>     so my second problem, this time with Ancient Greek. I cannot
>     easily reproduce this with a 2-line corpus, because I don't know
>     where the culprit is. I am posting the CWB Output instead, maybe
>     this is already enough.
>
>     What I am trying to do: I am trying to align three documents, one
>     Greek and two Slavic texts, using the aligVerse structural
>     element. The two Slavic ones align fine, the Greek gives me the
>     following error:
>     rvw at rvw-Latitude-E6410:/data/PROIEL$
>     /opt/CWBUTF8/cwb/utils/cwb-align -r /data/PROIEL/Registry -S
>     aligVerse -o out.align NTESTAMENT_GR NTESTAMENT_MN aligVerse
>     OPENING NTESTAMENT_GR [147613 tokens, 7497 <aligVerse> regions]
>     OPENING NTESTAMENT_MN [71935 tokens, 7497 <aligVerse> regions]
>     OPENING prealignment [NTESTAMENT_GR.aligVerse: 7497 regions,
>     NTESTAMENT_MN.aligVerse: 7497 regions]
>     LEXICON SIZE: 18085 / 10132
>     FEATURE: character count, weight=1 ... [1]
>     FEATURE: Shared words, threshold=40.0%, weight=50 ... [0]
>     FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8
>     string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     [21952]
>     FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8
>     string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     [614656]
>     [636609 features allocated]
>     [520402 entries in source text feature map]
>     [246622 entries in target text feature map]
>     PASS 2: Setting character count weight.
>     PASS 2: Processing shared words (th=40.0%).
>     PASS 2: Processing 3-grams.
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     PASS 2: Processing 4-grams.
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     CL: major error, invalid UTF8 string passed to cl_string_canonical...
>     PASS 2: Creating character counts.
>     [checking pointers]
>     ERROR: fcount1[1387]=24 r->w2f1[1388]-r->w2f1[1387]=22 w=``ἥξουσιν''
>     ERROR: fcount1[1388]=50 r->w2f1[1389]-r->w2f1[1388]=52
>     w=``ἀνακλιθήσονται''
>     ERROR: fcount1[1783]=24 r->w2f1[1784]-r->w2f1[1783]=22 w=``θάνατον''
>     ERROR: fcount1[1784]=50 r->w2f1[1785]-r->w2f1[1784]=52
>     w=``ἐπαναστήσονται''
>     ERROR: fcount1[3037]=20 r->w2f1[3038]-r->w2f1[3037]=16 w=``δυνατά''
>     ERROR: fcount1[3039]=48 r->w2f1[3040]-r->w2f1[3039]=52
>     w=``ἀκολουθήσαντές''
>     ERROR: fcount1[3784]=20 r->w2f1[3785]-r->w2f1[3784]=18 w=``ἤλθατε''
>     ERROR: fcount1[3785]=50 r->w2f1[3786]-r->w2f1[3785]=52
>     w=``ἀποκριθήσονται''
>     ERROR: fcount1[4459]=32 r->w2f1[4460]-r->w2f1[4459]=30 w=``ἐπιθυμίαι''
>     ERROR: fcount1[4460]=50 r->w2f1[4461]-r->w2f1[4460]=52
>     w=``εἰσπορευόμεναι''
>     ERROR: fcount1[4998]=20 r->w2f1[4999]-r->w2f1[4998]=18 w=``Ἤρξατο''
>     ERROR: fcount1[4999]=46 r->w2f1[5000]-r->w2f1[4999]=48
>     w=``ἠκολουθήκαμέν''
>     ERROR: fcount1[5038]=36 r->w2f1[5039]-r->w2f1[5038]=34
>     w=``ἐγγίζουσιν''
>     ERROR: fcount1[5039]=50 r->w2f1[5040]-r->w2f1[5039]=52
>     w=``εἰσπορευόμενοι''
>     ERROR: fcount1[7009]=32 r->w2f1[7010]-r->w2f1[7009]=30 w=``πλουσίους''
>     ERROR: fcount1[7010]=46 r->w2f1[7011]-r->w2f1[7010]=48
>     w=``ἀντικαλέσωσίν''
>     ERROR: fcount1[8582]=20 r->w2f1[8583]-r->w2f1[8582]=18 w=``ἐξάγει''
>     ERROR: fcount1[8583]=50 r->w2f1[8584]-r->w2f1[8583]=52
>     w=``ἀκολουθήσουσιν''
>     ERROR: fcount1[9942]=20 r->w2f1[9943]-r->w2f1[9942]=24 w=``ἅρματι''
>     ERROR: fcount1[9943]=56 r->w2f1[9944]-r->w2f1[9943]=52
>     w=``ἀναγινώσκοντος''
>     ERROR: fcount1[10119]=48 r->w2f1[10120]-r->w2f1[10119]=44
>     w=``μεταπέμψασθαί''
>     ERROR: fcount1[10120]=48 r->w2f1[10121]-r->w2f1[10120]=52
>     w=``εἰσκαλεσάμενος''
>     ERROR: fcount1[10553]=28 r->w2f1[10554]-r->w2f1[10553]=24
>     w=``ἐτάραξαν''
>     ERROR: fcount1[10554]=48 r->w2f1[10555]-r->w2f1[10554]=52
>     w=``ἀνασκευάζοντες''
>     ERROR: fcount1[10622]=24 r->w2f1[10623]-r->w2f1[10622]=20
>     w=``Τρῳάδος''
>     ERROR: fcount1[10623]=48 r->w2f1[10624]-r->w2f1[10623]=52
>     w=``εὐθυδρομήσαμεν''
>     ERROR: fcount1[11159]=48 r->w2f1[11160]-r->w2f1[11159]=44
>     w=``ἀποσπασθέντας''
>     ERROR: fcount1[11160]=52 r->w2f1[11161]-r->w2f1[11160]=56
>     w=``εὐθυδρομήσαντες''
>     ERROR: fcount1[12054]=20 r->w2f1[12055]-r->w2f1[12054]=18 w=``πλάνης''
>     ERROR: fcount1[12055]=50 r->w2f1[12056]-r->w2f1[12055]=52
>     w=``ἀπολαμβάνοντες''
>     ERROR: fcount1[12422]=12 r->w2f1[12423]-r->w2f1[12422]=10 w=``νοός''
>     ERROR: fcount1[12423]=50 r->w2f1[12424]-r->w2f1[12423]=52
>     w=``αἰχμαλωτίζοντά''
>     ERROR: fcount1[14334]=40 r->w2f1[14335]-r->w2f1[14334]=38
>     w=``ἐπαιρόμενον''
>     ERROR: fcount1[14335]=54 r->w2f1[14336]-r->w2f1[14335]=56
>     w=``αἰχμαλωτίζοντες''
>     ERROR: fcount1[14641]=40 r->w2f1[14642]-r->w2f1[14641]=38
>     w=``κεκυρωμένην''
>     ERROR: fcount1[14642]=50 r->w2f1[14643]-r->w2f1[14642]=52
>     w=``ἐπιδιατάσσεται''
>     ERROR: fcount1[14878]=32 r->w2f1[14879]-r->w2f1[14878]=34
>     w=``προέγραψα''
>     ERROR: fcount1[14879]=54 r->w2f1[14880]-r->w2f1[14879]=52
>     w=``ἀναγινώσκοντες''
>     ERROR: fcount1[15698]=36 r->w2f1[15699]-r->w2f1[15698]=34
>     w=``ἐπιστεύθην''
>     ERROR: fcount1[15699]=46 r->w2f1[15700]-r->w2f1[15699]=48
>     w=``ἐνδυναμώσαντί''
>     ERROR: fcount1[16170]=32 r->w2f1[16171]-r->w2f1[16170]=30
>     w=``ἀνέξονται''
>     ERROR: fcount1[16171]=50 r->w2f1[16172]-r->w2f1[16171]=52
>     w=``ἐπισωρεύσουσιν''
>     ERROR: fcount1[16815]=32 r->w2f1[16816]-r->w2f1[16815]=30
>     w=``ἐνυβρίσας''
>     ERROR: fcount1[16816]=50 r->w2f1[16817]-r->w2f1[16816]=52
>     w=``Ἀναμιμνῄσκεσθε''
>     ERROR: fcount1[17621]=40 r->w2f1[17622]-r->w2f1[17621]=42
>     w=``ἀπεσταλμένα''
>     ERROR: fcount1[17622]=56 r->w2f1[17623]-r->w2f1[17622]=54
>     w=``εἴκοσι τέσσαρες''
>     ERROR: fcount1[17793]=28 r->w2f1[17794]-r->w2f1[17793]=29
>     w=``μάρτυσίν''
>     ERROR: fcount1[17794]=93 r->w2f1[17795]-r->w2f1[17794]=92
>     w=``χιλίας διακοσίας ἑξήκοντα''
>     ERROR: fcount1[17937]=24 r->w2f1[17938]-r->w2f1[17937]=26
>     w=``χαλινῶν''
>     ERROR: fcount1[17938]=60 r->w2f1[17939]-r->w2f1[17938]=58
>     w=``χιλίων ἑξακοσίων''
>     ERROR: fcount1[17967]=36 r->w2f1[17968]-r->w2f1[17967]=34
>     w=``καυματίσαι''
>     ERROR: fcount1[17968]=50 r->w2f1[17969]-r->w2f1[17968]=52
>     w=``ἐκαυματίσθησαν''
>
>
>     Again, I would be very thankful for help.
>
>     Best!
>     Ruprecht
>
>
>
>
>
>     Am 10.03.2015 um 12:07 schrieb Ruprecht von Waldenfels:
>>     Hi Andrew,
>>     YES! This does solve the problem. I was thinking this setting
>>     would only concern tokens, not the lemma attribute, but now I
>>     understand that this was a wrong assumption. Thank you!
>>     I will now look at the other problem - because that, as it turns
>>     out, is unrelated.
>>     Thanks A LOT!
>>     Ruprecht
>>     Am 10.03.2015 um 12:02 schrieb Hardie, Andrew:
>>>
>>>     Is the context size measured in characters? If so, that would
>>>     explain the problem, since “characters” = bytes still.
>>>
>>>     If changing the context width to a given number of words fixes
>>>     the issue, then that is the solution.
>>>
>>>     I have been working on a patch to fix this, but have not
>>>     completed it yet.
>>>
>>>     Andrew.
>>>
>>>     *From:*cwb-bounces at sslmit.unibo.it
>>>     <mailto:cwb-bounces at sslmit.unibo.it>
>>>     [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Ruprecht von
>>>     Waldenfels
>>>     *Sent:* 10 March 2015 09:54
>>>     *To:* cwb at sslmit.unibo.it <mailto:cwb at sslmit.unibo.it>
>>>     *Subject:* [CWB] unicode problems with Greek and OCS
>>>
>>>     Dear List,
>>>
>>>     I am using CWB 3.4.8 on 64 bit Ubuntu 14.10.
>>>     After encoding a text in Old Church Slavonic, I get invalid
>>>     UTF-8 character errors; I seem to get them only in sgml mode (I
>>>     also get them during alignment with the Ancient Greek
>>>     translation source, which might be a related problem, but I am
>>>     not sure.)
>>>
>>>     In order to pinpoint the problem with the Old Church Slavonic
>>>     text, I have reduced the text in question to two bible verses.
>>>     The text can be found here: www.parasolcorpus.org/test.txt
>>>     <http://www.parasolcorpus.org/test.txt>
>>>
>>>     I encode the corpus with the following commands:
>>>     /opt/CWBUTF8/cwb/utils/cwb-encode -d Data/ntestament_tt -f
>>>     test.txt -R /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB -P
>>>     lemma -P id -P alig -P pos -P tag -S aligVerse:0
>>>     /opt/CWBUTF8/cwb/utils/cwb-makeall -r /data/PROIEL/Registry
>>>     NTESTAMENT_TT
>>>
>>>     There is no problem in text mode:
>>>
>>>
>>>
>>>     However, in sgml mode, some lemmas get truncated and do not
>>>     contain valid utf8 anymore. For example, the lemma of "с҃вщаѩи"
>>>     is such a token. This problem does NOT appear if I search for
>>>     this token itself, it ONLY and consistently appears if I search
>>>     for a different token and the problematic token is in the result
>>>     set:
>>>
>>>
>>>     To sum up: I get the problem only if I search for a neighboring
>>>     token in sgml mode. I don't get it if I search for the token
>>>     itself, and I don't get it in text mode. I have reduced the
>>>     problem to w 50-token text, and the problem persists.
>>>
>>>     Any help would be greatly appreciated!
>>>     Best,
>>>     Ruprecht
>>>
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     CWB mailing list
>>>     CWB at sslmit.unibo.it  <mailto:CWB at sslmit.unibo.it>
>>>     http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>>     _______________________________________________
>>     CWB mailing list
>>     CWB at sslmit.unibo.it  <mailto:CWB at sslmit.unibo.it>
>>     http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>     _______________________________________________
>     CWB mailing list
>     CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>     http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/8ed6fbe9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 77569 bytes
Desc: not available
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/8ed6fbe9/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 85145 bytes
Desc: not available
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/8ed6fbe9/attachment-0003.png>