<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body dir="auto"><div dir="auto">Brilliant! Thanks as ever for your full and precise answer, Andrew. I think that, given the material, I'm probably looking at a parallel corpus set up. I'm going to have fun transforming my colleague's fairly anarchic word files into something palatable but that's another story! </div><div dir="auto">Best, </div><div dir="auto">Graham.</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><br></div><div id="composer_signature" dir="auto"><div style="font-size:14px;color:#909090" dir="auto">Envoyé depuis mon appareil Galaxy</div></div><div dir="auto"><br></div><div><br></div><div align="left" dir="auto" style="font-size:100%;color:#000000"><div>-------- Message d'origine --------</div><div>De : "Hardie, Andrew via CWB" <cwb@sslmit.unibo.it> </div><div>Date : 21/12/2025 15:41 (GMT+01:00) </div><div>À : Open source development of the Corpus WorkBench <cwb@sslmit.unibo.it> </div><div>Cc : "Hardie, Andrew" <a.hardie@lancaster.ac.uk> </div><div>Objet : Re: [CWB] cqpweb and phonetic transcription </div><div><br></div></div>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">I’ve indexed various corpora whose primary token stream was an IPA transcription (because the language was one without a written form).
It works just as normal. Remember CQPweb as software is totally agnostic as to the script that the data uses, so IPA is just as good as Latin, Greek, Cyrillic, Japanese, or whatever.</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">But that means that, just like data in any other script, you need it to be tokenised, and any word-level annotation needs to be presented
alongside the tokens as extra columns in the Vrt file. </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">So for instance you can have IPA as an annotation, alongside others possibly, e.g. a POS as here:</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">my maɪ POSSPRO</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">name ne:m NOUN</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">is ɪz VERB</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">Andrew andɹu: NOUN</span></b><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">Or you can have the primary data be in IPA, and then either add or don’t add the orthographic form as annotation:</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">maɪ my
</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">ne:m name
</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">ɪz is
</span></b></p>
<p style="margin-left:36.0pt" class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Courier New";color:#156082;mso-fareast-language:EN-US">andɹu: Andrew
</span></b><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"></span></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></b></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">IN SUM, If your standard French and your IPA transcriptions line up word by word, you can use one of them as an annotation on the other.
Then, you can search on either in the usual way using either CQL or simple query. This is the best and most flexible approach.</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">If the word lineup
<i>doesn’t</i> match, so you can’t do it as per above, then either of the techniques you mention, IE giving the Stand.Fr. as a sentence-level translation, or using two “parallel” corpora, would work. Neither is the ideal way to handle this kind of data. But
if you don’t have tokenisation lineup, then you might have to go with one of these.</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">>></span> Would the first type allow for searches that start with the IPA transcription?<span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">So long as your IPA data is either the “word” (first column of the input) or an annotation (second column), you can search it.
</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">(Your users would need an IPA soft keyboard of course. I am working on adding soft keyboards, but it’s not complete yet.)</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">>></span> One last question: I think that the audio could be linked to the files as metadata. Is this right?<span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">Yes. See admin manual section 7.5.1. Provide address of the files with the
<b>audio:</b> prefix.</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">best</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US">Andrew.</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#156082;mso-fareast-language:EN-US"> </span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm;font-size:pt">
<p class="MsoNormal"><b><span style="font-family:"Calibri",sans-serif">From:</span></b><span style="font-family:"Calibri",sans-serif"> CWB <cwb-bounces@sslmit.unibo.it>
<b>On Behalf Of </b>Graham Ranger -- UAPV via CWB<br>
<b>Sent:</b> 21 December 2025 13:25<br>
<b>To:</b> cwb@sslmit.unibo.it<br>
<b>Cc:</b> Graham Ranger -- UAPV <graham.ranger@univ-avignon.fr><br>
<b>Subject:</b> [CWB] cqpweb and phonetic transcription</span></p>
</div>
</div>
<p class="MsoNormal"> </p>
<div>
<p class="MsoNormal">Hello again,<br>
A second question, on a different thread for clarity: does anybody have experience with text and phonetic transcription? Specifically, I have transcriptions of interviews made 30-40 years ago in a form of regional French that only had 40 speakers at the time.
I have 1) IPA transcriptions, with one or two local conventions for pauses, etc. and 2) reformulations in standard French. The variety being exclusively oral, this is all I have. Now, I would imagine that I could do this either as a corpus and its "translation"
or as a single corpus with the transcriptions as sentence-level attributs <s trans="..."> or something like that. Would the first type allow for searches that start with the IPA transcription? The second type appears of rather limited interest, since searches
would need to start with the reformulation. One last question: I think that the audio could be linked to the files as metadata. Is this right?<br>
In short, any accounts of user experiences with similar corpora would be very helpful!<br>
Best,<br>
Graham. </p>
</div>
</div>
</body></html>