Review "The Unicode Cookbook for Linguistics" This books gives an overview of the intersection of the IPA alphabet and Unicode, preceded by a discussion of foundational concepts in writing systems, and followed by the suggestion to establish "Orthography profiles" as a theoretical concept, which are are in turn implemented in Python and R. Global evaluation: The book is concise and can serve as a useful manual for Linguists Who Have Always Wondered About Unicode But Never Dared To Ask. The exposition of the background is in general well organised and easy to follow (for a technical paper). The writing is good in general, but sometimes repetitive. Minor suggestions for reorganisation of passages are given below. A higher amount of illustrations and charts would make that part more accessible. The orthography profiles section is less well written and at times a bit confused. It does not seem completely finished either, since there are dangling references and literal TODO in the text. For this section, it will be necessary to have a very clear picture about the intended audience. For instance, do the authors assume the readers know bash, python and R? Most passages do, but some passages explain things that a tech-savvy audience will well be aware of and find tedious. The language of the implementation part is of less high quality than the introductory parts. The authors should check the use of single and double quotes across the paper, which is incoherent. As far as I can tell, all instances of material currently enquoted should be in double quotes. Also, the authors should check Special Capitalization, which is used in some places. Finally, the bibliography has recurrent issues relating to naming of organizations, capitalization, and missing places of publication and needs to be thoroughly revised. A section which is missing in something called a "Cookbook" would be practical recommendations on how to input Unicode characters. There are various character selection tools, shortcuts on the keyboard, the shapecatcher website references at several places, or the Wikipedia lists of glyphs and fileformat.info. Having all this in one section would be handy for the user. It is of course unrelated to the orthography profiles, but I imagine that many people will use this book as a primer on IPA+Unicode and actually disregard the last two chapters. For this group, such a summary would be useful. Overall recommendation: publish with minor revisions. ============================================================================= Detailed comments Preface - our research": who is "we"? I first I took this to mean the Unicode research, but apparently the day-to-day research of the authors in other fields is intended here. - "We welcome comments ...": this might be a good place to add information about the book being available on the Paperhive platform, where readers can leave questions and comments. Chapter 1 - 1.1.: two instances of "normally" in close succession - automatic processing --> automated processing - Chapter4 --> Chapter 4 - are expressed as U + n where --> are expressed as U + n, where - Chapter?? --> resolve reference Section "Telegraphy" - it is slightly odd that whistling and drumming are treated as instances of teleGRAPHY here. Neither whistling nor drumming use writing, so telePHONY or teleSONY might be more appropriate (but have obvious other issues). - Footnote 6 "In effect ..." must be rephrased (or deleted, not sure if this information is actually relevant) 1.3 Linguistic terminology - the definition of writing system is odd. It says "system that uses visible or tactile signs". That would be true for sign languages as well (and also for communication systems used by blind signers), but we would not want to call these signers' systems a "writing system". A better definition could be "shapes on a support", where the shapes could be perceived with the eyes or, as with Braille, with the fingers. The support could be paper, clay, or a screen. You might find yet another definition which suits your purposes, but the current definition should be revised in any case. - in the same section: do not italicize quoted strings. - "as used among the world's languages to represent the language": delete - "All orthographies are language-specific": suggest dropping "All" unless this is an empirical claim and not a general truth. Section script systems - "Graphemes consist of characters": rewrite this whole passage. Some problems are given below. - It is unclear to me what a character is. Maybe "a" and "b" are characters used for the graphemes and , and "c" and "h" are characters used in the grapheme ? - "In practice, characters often consist of multiple building blocks, each of which could be considered a character in its own right". So characters consist of characters? Are there some "terminal characters", which do not consist of other characters? Or do "bdpq" all consist of a stroke and an eye, which are then characters??? This has to become clearer. In this overview section, it would be helpful to set off the definitions, e.g. (1) A GRAPHEME is .... .... (2) A CHARACTER is ... ... (3) A GLYPH is ... - Footnote 7 refers to characters in Chinese (maybe radicals are meant here???), and then states that this will not be further explored. But about 10 lines down the page, the authors say "although a Chinese character ...", which does look like a further exploration indeed. Suggest dropping footnote 7, or alternatively make the argument with an example other than Chinese. - "A diacritic mark ... may be above, below, or through": add left and right diacritics, e.g. Hindi long and short attach to either side. - "A GRAPHEME is the basic, minimally distinctive symbol ..., alike to the phoneme, which is an abstract representation ...": Does this entail, then, that the grapheme is an abstract representation as well? Possibly instantiated by an (allo)graph? The abstract vs. concrete nature of the grapheme might merit some discussion here. - "For example, in English orthography represents a combination of the phonemes /k/ and /s/.": give an example, e.g. /IndEks/. Note that "Xerox" or "example" use for other phonemes or phoneme combinations, and that /sOks/ uses the phonemes /ks/ which are not represented by . - It might be interesting to briefly mention the non-trivial mapping of phonemes on graphemes, e.g. give/gin/jingle, where the graphemes / and the phonemes /g/dZ/ have a complex mapping. 1.4. The Unicode [Ss]tandard - "distinction between universal (ASCII) versus language-particular (font).": The meaning of this escapes me. - "However, ... ": the discussion of locales should not be found this early on in the introduction of Unicode. Suggest moving it down. Character encoding system - "the character encoding is represented ... , which is used to encode a set of characters": rephrase or delete - "non-negative": delete. You are talking to a human audience. No human will try negative integers as a code point. - Footnote 10: the equivalent of hex 0070 would be binary 0000000011100000. Since you use leading zeros in hex, you should do the same for binary. But this footnote can probably be deleted. People who know about different base encodings will find this trivial, and people who don't will find this frightening and will not profit from the description. - "A font is a set of glyphs linked to code points": --> "A font is a set of connections between glyphs and code points" - "(though note ...)" : put in footnote - aptly-names --> aptly-named - "PRECOMPOSED code point" --> "PRECOMPOSED CODE POINT" (or "PRECOMPOSED WHATEVER", but not bare "PRECOMPOSED") - "the Unicode Standard offers different kinds of normalization ...": normalization is a tough concept to grasp. Find exactly one place where to discuss it and keep discussions of it elsewhere to a minimum. Either don't discuss it there at all, or mention it and cross-refer to the in-depth discussion. - "For example, the sequence ...": this sentence no verb. - instead of Slovak and Sisaala, readers might find English or French more useful. In the context of , one could also discuss whether and <œ> are the same grapheme, and what the status of its composing characters is, if any. A little chart or illustration exemplifying graphemes, characters, glyphs, tailored graphemes and the relation between then (meronymic and mapping) might be useful here. Chapter 2 - Next to the notion of "pitfall", it may or may not be useful to introduce the notion of "gotcha". This is synonymous, as far as I can tell, but it emphasizes that there is a deliberate design decision which users are often not aware of and which leads to unsuspected (but completely legit) behaviour. - "differently as expected" --> than - "practical use cases-that" --> than Section 2.2 - It appears that before "First", there would be "Zeroth: a 1:1 mapping, e.g. :'t'" - Subheadings should be useful here. - I do not get the Tamil part. Are you talking about U+0B94ஔ, which visually looks like a combination of U+0B93ஓ and U+0BA9ன? If so, give these glyphs. But for the case in point, Sinhala might actually be a more pertinant example. Sinhala U+0DD9 ෙ is one code point and is made up of one glyph, which looks like a spiral. U+0DDB ෛ is one code point and is made of twice that spirally glyph. From what I understand from the passage, this would be a suitable example. If not, I have probably misunderstood the passage, and the meaning should be clarified. - in the discussion of the storeys of and , the elements currently put between <> are not graphemes but glyphs. Do not use angle brackets here, but find a way to design glyphs, e.g \fbox{}ing them - "tweaking baseline and/or kerning". These concepts cannot be assumed familiar. Either delete this passage, or add explanations in a footnote, or put the complete baseline/kerning passage in a footnote and expand. Chapter 3 - general comment: replace single quotes by double quotes everywhere. Occasionally, italics might be an option, too. - the IPA Association --> the International Phonetic Association - when the twain shall meet --> when the twain met OR when the twain should meet - Footnote 2 "IPA(" --> "IPA (" Section Principles - "the difference in English between an aspirated /p/ in [] or an unreleased /p/ in []": I believe that if aspiration of /p/ is marked, then aspiration for the t's should be marked as well (and unreleased final /t/, if applicable). Maybe find another pair of examples. - German, Dutch, English and French /t/: The mapping of dental/aspirated to the languages seems odd. Standard German /t/ is not dental, but Dutch /t/ is. European French /t/ is not dental (but Canadian French /t/ is). If this is taken from some other resource, a reference should be given. In any case, that list cannot be taken to be self-evident. - "Similarly... Similarly" in close succession Section IPA numbers "assigning ... assigned ... assigned". More lexical variation, please. "three-digit number numerical directory of digit triples". This seems like a pleonastic tautology to me. - Footnotemark ".⁷.": delete one period - "as an IPA symbol codings": not sure what is meant here, but agreement is odd - "(Computer 1985,1986,1988)": please fix this in the bib file - "16 bit" --> "16-bit" - "were published tables" --> "were published as tables" - "inline" --> "in line" - "development linguistic insights" --> "development of linguistic insights." - "along the line then" --> "along the line than" Chapter 4 Footnote 1: Moran 2012 --> Moran (2012) 4.3. the first bullet point discusses the use of the apostrophe for the glottal stop. This is indeed related to phonetics and Unicode, but the glottal stop is not represented by an apostrophe-like thing in IPA, so this problem is unrelated to the chapter entitled "IPA meets Unicode". Restrict this to ejective marker. - discussion of slanted glyphs: it might be worthwhile to note that we had to use the slanted approach for "A dictionary and grammatical outline of Chakali" (http://langsci-press.org/catalog/book/74), where the two different a's are used in the orthography. 4.5. multiple encoding options - another issue which might be relevant here is the "i dot-suppression". When putting an acute on an , the dot will vanish (already in the precomposed form, so this does not seem to be a font choice.). This will make the combination of +<´> look like the combination of <ı>+<´>. Related to that is the unavailability of dotless barred i, so <ɨ>+<´> will give you <́ɨ>, with both dot and acute. This is a case where the combining overlay bar is useful, I will return to this below. I am not sure whether the dot-suppression should be discussed here, or whether it might be some other kind of pitfall, but it should be mentioned somewhere. - Footnote 7: when discussing superscript numbers for tone, it could be worthwhile to say a word about Unicode superscript numbers (https://en.wikipedia.org/wiki/Superscripts_and_Subscripts) vs. using the text processor superscript formatting. - "only one options" --> "only one option" - missing decomposition: the case of accented barred i could be discussed here. 4.9. - "Tone letters are normally written ...": give the Unicode code points here as well, in analogy to bullet points 1 and 3. 4.10. - "but strangely enough ...": delete. The cedilla issue has been discussed before. - "the diacritics on top of a character": to me, "on top" is a synonym of "above", so an acute would be on top of an for instance. My interpretation might be unwarranted, but maybe another preposition like "across" can be found, or something like "overlay". - "removed symbols as labels them as clicks": rephrase - "Mc Laughlin": delete space? 4.12. - Drop Moran (2012) in first sentence. 4.13. - is widened-IPA a superset of valid-IPA or only a superset of strict-IPA? A diagram showing the relation of the different sets might be handy. - I feel that superscript nasals for presnalized consonants should be discussed somewhere. I am not sure as to their status in IPA, but they are certainly used a lot. The current list only gives the superscript n, but at least m and engma should be discussed. 5.1 - "Although these variants are visually homoglyphs": this has been discussed under normalization. Shorten or delete this and add a cross-ref - "For example can": rephrase 5.2. - "possibly ... possibly": rephrase 5.3. Formal specification - give a figure with an open file where you indicate the relevant portions at the very beginning of this section. The prose description is very dense, and an illustration will make it easier to relate. Section Implementation - can orthography formats handle negative lookahead/lookbehind? Note that this is different from a positive look for the set complement if the string looked for is longer than 1. B6. I feel that a language which has both and will get you different results depending on global or linear algorithm. "Ascha" should give you "A sch a" with the linear approach, but "a s ch a" with the global approach iff "ch" is matched before "sch". C8. - "To treat the profile literal" --> literally Chapter 6 - "Each section has a number of subsections that ... " - the example "aäaaöaüaa" will confuse a lot of ordinary working linguists. There are too many vowels, and the representation of "aa" by is weird. I suggest using the German word "Schächtelchen", where you have both and to show the importance of ordering, and a dieresis to show decomposition (which seem to be the main didactic aspects of this passage to me). - the code examples should be numbered - the tabs in the code should have the same length, so that the headers and the columns align - "edit the profile: more": --> "edit the profile: vim" (or any other editor, but not more) - "The location can be found in the following command in R.": provide the command (or take care of the page break) - "Try tokenize to test the functionality": why is there an error message???? - "These webapps are available online at TODO": provide URL - "we first assign the two strings to a variable test": use a more descriptive name than "test" Section Using an orthography profile skeleton - " It is also possible ...It is also possible": rephrase - maybe this passage should be written with Italian or Spanish orthography as an example since English orthography is such a mess. Why are "ie" and "oo" digraphs? If so, would "ti" in "nation" count as a digraph as well? And why is the second "e" in "scheme" listed at all? It does not correspond to any phoneme (but arguably it influences the first "e"). Also, for me, "sch" in "scheme" consist of two phonemes (/sk/), just like "sch" in "mischief (/stS/)". But the passage "Now mischief is parsed correctly, but scheme is wrong" suggests that the author sees this differently??? - Footnote 3: delete. Feeding and bleeding are needlessly complicating the exposition - RESULT_ERRORS.TSV: I stronly suggest to either create an empty file when no errors are reported, or to include "no errors" in the file. The meaningful absence of computer files can cause all kinds of problems. - "Such contextually-specified graphemes are based on regular expressions so you can also use regular expressions in the descriptions of the context": who would've thought that! - "It is important to realize ... ": delete. The need for escape sequences has been discussed earlier. - Table 6.11: I was wondering whether graphemes can be in several classes. "Y" could be in the consonant class and in the vowel class for instance. Can orthography profiles handle this, and if they can, how would this be entered?