Unicode diacritics using combining characters fail in Numbers

Unicode characters work normally very well in Numbers, but I think I found a bug in how some text functions, such as RIGHT(), work when unicode combining characters are used.


There are several ways to write a Greek small iota with perispomeni (ῖ)

  1. build from scratch "ι"&CHAR("0x342")
  2. use existing unicode CHAR("0x1FD6")


Similarly, Greek iota with macron and acute accent (ῑ́) can be

  1. built from scratch "ι"&CHAR("0x304")&CHAR("0x301"), or
  2. using unicode Greek small iota with macron CHAR("0x1FD1")&CHAR("0x301").


In Numbers, the resulting characters are in both cases identical under =, though they have different values under CODE(), 4093103833860 resp. 34982508626689 Greek small iota with macron.


In general this all works well in Numbers, at least once a suitable font is used (here Arial).


Indeed, if I use only single, unicode-ready characters, for instance the 5 last characters using RIGHT(*, 5) of eg. αβγῑκλμῖπρσ are, correctly, μῖπρσ. Five in total.


Problems: if I create any diacritics using Unicode combining characters and CHAR(), the last FIVE (5) characters are wrong

  • αβγικλμῖπρσ gives ῖπρσ which is 4 -characters - WRONG
  • αβγῑ́κλμιπρσ gives κλμιπρσ or λμιπρσ (depending on construction), which is 7 resp. 6 characters - WRONG
  • αβγῑ́κλμῖπρσ gives κλμῖπρσ or λμῖπρσ (depending construction), which is 7 resp. 6 characters - WRONG


So there is a problem with both of the diacritics I tested, both when used together and in isolation. Here below there is a screenshot of the worksheet I used to show the problem:


Tried so far:

  • checked that diacritics are from right unicode blocks
  • removing any non-printing matter with CLEAN(), TRIM() and checking with LEN().


Questions:

  • How can I construct Unicode characters with combining characters that are correctly understood by Numbers?
  • What's the underlying problem here?


If there are Numbers and/or Unicode resources on this topic, I'd be grateful.

Mac Studio, macOS 15.1

Posted on Jan 29, 2025 7:07 AM

Reply

Similar questions

21 replies

Jan 31, 2025 7:03 AM in response to arkki

arkki wrote: • Let's hope both Unicode extend their character sets

They won't be creating any more precomposed characters like that. For display, either fonts need to create the right glyphs from a sequence of characters, or the PUA needs to be used. For the latter, there is a chart of one convention for polytonic Greek at


https://apagreekkeys.org/technicalDetails.html


I'm not familiar with kind of manipulation you are seeking to do in a spreadsheet with polytonic greek text. How would you describe the goal of that? Is it essentially to identify a specific number of characters or glyphs at the ends of words?


Jan 30, 2025 3:44 AM in response to Tom Gewecke

This came up trying to keep track of paradigms, such as


  • δελφῑ́ς, δελφῖνος
  • γῡ́ψ, γῡπός


Normal polytonic text doesn't show macrons (unfortunately), but it's needed for grammar.


Glad one can write all these characters though, even if it's difficult to see what's actually going on. For instance, if I translate the CODE output to HEX, there is no indication of what breathing or accent the character has, just the macron:


  • ᾱ̔́ (alpha macron asper acute) gives 0x3B100000304, and so does


  • ᾱ̓̀ (alpha macron lenis grave).


(In the above, the unicode normalisation in Arial?/Safari? seems to stack up everything, which is not correct behaviour either, but there are other fonts where diacritics get arranged correctly.)


Jan 30, 2025 4:55 AM in response to arkki

arkki wrote:


Normal polytonic text doesn't show macrons (unfortunately), but it's needed for grammar.

Aha. Without the macrons I suppose everything would work right, because all the characters are available in precomposed form?



Glad one can write all these characters though, even if it's difficult to see what's actually going on. For instance, if I translate the CODE output to HEX, there is no indication of what breathing or accent the character has, just the macron:

ᾱ̔́ (alpha macron asper acute) gives 0x3B100000304, and so does

That doesn't seem right, will try to check myself. How are you getting the hex exactly?


Jan 30, 2025 10:18 AM in response to Tom Gewecke

Thanks for thinking about this.


According to Unicode docs, 1FDF has a space in it, so that's why the diacritics go next to the character: ᾱ῞, not on top of it.


I think the forum website does some normalising when I write to it, which is annoying. But there are at least these ways of writing alpha macron asper acute:


I have checked that the DEC2HEX() call works as I'd naively expect: takes a decimal number and spits out the correct base-16 number in the usual notation. Am I missing something here?


The utf-16 figure you suggest (0x1FB103140301) would make a whole lot of sense but, unfortunately, when I construct it by hand as a decimal (34845121315585) and then feed that into CHAR(), the result is an empty string. So not implemented then, I guess.

Jan 30, 2025 11:29 AM in response to arkki

You are right, I forgot that 1FDE and similar are not "combining" but spacing, so you can't put them over a base.


ᾱ̔́ is three characters, no way to represent it with a single number, whether decimal or utf format, it needs 3 separate numbers. If you put macron on top you can reduce it to two characters/numbers.


It's perhaps unfortunate that bold or underline was not chosen for the convention represented by macron in this kind of text, as it would not show up in the codepoints:-)


I'm wondering if there are some special greek fonts which have precomposed versions of these things. Though that could cause other problems, as they are not in Unicode as singles.



Jan 31, 2025 7:29 AM in response to Tom Gewecke

I'm not familiar with kind of manipulation you are seeking to do in a spreadsheet with polytonic greek text. How would you describe the goal of that? Is it essentially to identify a specific number of characters or glyphs at the ends of words?


I'm learning the language. This is for making lists of lexemes and manipulating them to produce inflectional forms. Helps to spot similarities and differences.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Unicode diacritics using combining characters fail in Numbers

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.