Unicode diacritics using combining characters fail in Numbers

Question

Level 1

12 points

Unicode diacritics using combining characters fail in Numbers

Unicode characters work normally very well in Numbers, but I think I found a bug in how some text functions, such as RIGHT(), work when unicode combining characters are used.

There are several ways to write a Greek small iota with perispomeni (ῖ)

build from scratch "ι"&CHAR("0x342")
use existing unicode CHAR("0x1FD6")

Similarly, Greek iota with macron and acute accent (ῑ́) can be

built from scratch "ι"&CHAR("0x304")&CHAR("0x301"), or
using unicode Greek small iota with macron CHAR("0x1FD1")&CHAR("0x301").

In Numbers, the resulting characters are in both cases identical under =, though they have different values under CODE(), 4093103833860 resp. 34982508626689 Greek small iota with macron.

In general this all works well in Numbers, at least once a suitable font is used (here Arial).

Indeed, if I use only single, unicode-ready characters, for instance the 5 last characters using RIGHT(*, 5) of eg. αβγῑκλμῖπρσ are, correctly, μῖπρσ. Five in total.

Problems: if I create any diacritics using Unicode combining characters and CHAR(), the last FIVE (5) characters are wrong

αβγικλμῖπρσ gives ῖπρσ which is 4 -characters - WRONG
αβγῑ́κλμιπρσ gives κλμιπρσ or λμιπρσ (depending on construction), which is 7 resp. 6 characters - WRONG
αβγῑ́κλμῖπρσ gives κλμῖπρσ or λμῖπρσ (depending construction), which is 7 resp. 6 characters - WRONG

So there is a problem with both of the diacritics I tested, both when used together and in isolation. Here below there is a screenshot of the worksheet I used to show the problem:

Tried so far:

checked that diacritics are from right unicode blocks
removing any non-printing matter with CLEAN(), TRIM() and checking with LEN().

Questions:

How can I construct Unicode characters with combining characters that are correctly understood by Numbers?
What's the underlying problem here?

If there are Numbers and/or Unicode resources on this topic, I'd be grateful.

Mac Studio, macOS 15.1

Posted on Jan 29, 2025 7:07 AM

Reply

Answer 1

Tom Gewecke

Level 10

122,318 points

Jan 29, 2025 9:53 AM in response to arkki

arkki wrote:

How can I construct Unicode characters with combining characters that are correctly understood

If the app does not currently work right with combining diacritics, then one fix might be to apply Unicode NFC normalization to the data beforehand.

https://unicode.org/reports/tr15/#Norm_Forms

Reply

Answer 2

Tom Gewecke

Level 10

122,318 points

Jan 29, 2025 10:44 AM in response to arkki

PS One app that can do Unicode Normalization is

https://earthlingsoft.net/UnicodeChecker/index.html

Reply

Answer 3

Tom Gewecke

Level 10

122,318 points

Jan 31, 2025 7:03 AM in response to arkki

arkki wrote: • Let's hope both Unicode extend their character sets

They won't be creating any more precomposed characters like that. For display, either fonts need to create the right glyphs from a sequence of characters, or the PUA needs to be used. For the latter, there is a chart of one convention for polytonic Greek at

https://apagreekkeys.org/technicalDetails.html

I'm not familiar with kind of manipulation you are seeking to do in a spreadsheet with polytonic greek text. How would you describe the goal of that? Is it essentially to identify a specific number of characters or glyphs at the ends of words?

Reply

Answer 4

Tom Gewecke

Level 10

122,318 points

Jan 29, 2025 9:24 AM in response to arkki

To let Apple know about this problem, you can use

hhttp://www.apple.com/feedback

Include a link to this discussion so they can see all your examples.

Reply

Answer 5

arkki Author

Level 1

12 points

Jan 29, 2025 1:19 PM in response to Tom Gewecke

Thanks! The app seems to be picking up only the last diacritic component (ie the combining character). If Numbers does the same somewhere inside the code, that could be the trouble. I'll keep experimenting.

Reply

Answer 6

Tom Gewecke

Level 10

122,318 points

Jan 29, 2025 8:04 PM in response to arkki

arkki wrote:

Thanks! The app seems to be picking up only the last diacritic component (ie the combining character). If Numbers does the same somewhere inside the code, that could be the trouble. I'll keep experimenting.

Are you testing this for random unicode text, or is your data actual polytonic greek text?

Reply

Answer 7

arkki Author

Level 1

12 points

Jan 30, 2025 3:44 AM in response to Tom Gewecke

This came up trying to keep track of paradigms, such as

δελφῑ́ς, δελφῖνος
γῡ́ψ, γῡπός

Normal polytonic text doesn't show macrons (unfortunately), but it's needed for grammar.

Glad one can write all these characters though, even if it's difficult to see what's actually going on. For instance, if I translate the CODE output to HEX, there is no indication of what breathing or accent the character has, just the macron:

ᾱ̔́ (alpha macron asper acute) gives 0x3B100000304, and so does

ᾱ̓̀ (alpha macron lenis grave).

(In the above, the unicode normalisation in Arial?/Safari? seems to stack up everything, which is not correct behaviour either, but there are other fonts where diacritics get arranged correctly.)

Reply

Answer 8

Tom Gewecke

Level 10

122,318 points

Jan 30, 2025 4:55 AM in response to arkki

arkki wrote:

Normal polytonic text doesn't show macrons (unfortunately), but it's needed for grammar.

Aha. Without the macrons I suppose everything would work right, because all the characters are available in precomposed form?

Glad one can write all these characters though, even if it's difficult to see what's actually going on. For instance, if I translate the CODE output to HEX, there is no indication of what breathing or accent the character has, just the macron:

ᾱ̔́ (alpha macron asper acute) gives 0x3B100000304, and so does

That doesn't seem right, will try to check myself. How are you getting the hex exactly?

Reply

Answer 9

Tom Gewecke

Level 10

122,318 points

Jan 30, 2025 8:02 AM in response to arkki

arkki wrote:

I just used DEC2HEX() in Numbers.

I'm not sure what that is supposed to do when applied to text. I think the actual utf-8 hex for your alpha macron asper acute is E1 BE B1 CC 94 CC 81. Or utf-16 1fb1 0314 0301

Would it be any better to use 1FDE dasia and oxia instead of 314 and 301?

Reply

Answer 10

arkki Author

Level 1

12 points

Jan 30, 2025 10:18 AM in response to Tom Gewecke

Thanks for thinking about this.

According to Unicode docs, 1FDF has a space in it, so that's why the diacritics go next to the character: ᾱ῞, not on top of it.

I think the forum website does some normalising when I write to it, which is annoying. But there are at least these ways of writing alpha macron asper acute:

I have checked that the DEC2HEX() call works as I'd naively expect: takes a decimal number and spits out the correct base-16 number in the usual notation. Am I missing something here?

The utf-16 figure you suggest (0x1FB103140301) would make a whole lot of sense but, unfortunately, when I construct it by hand as a decimal (34845121315585) and then feed that into CHAR(), the result is an empty string. So not implemented then, I guess.

Reply

Answer 11

Tom Gewecke

Level 10

122,318 points

Jan 30, 2025 11:29 AM in response to arkki

You are right, I forgot that 1FDE and similar are not "combining" but spacing, so you can't put them over a base.

ᾱ̔́ is three characters, no way to represent it with a single number, whether decimal or utf format, it needs 3 separate numbers. If you put macron on top you can reduce it to two characters/numbers.

It's perhaps unfortunate that bold or underline was not chosen for the convention represented by macron in this kind of text, as it would not show up in the codepoints:-)

I'm wondering if there are some special greek fonts which have precomposed versions of these things. Though that could cause other problems, as they are not in Unicode as singles.

Reply

Answer 12

arkki Author

Level 1

12 points

Jan 31, 2025 4:39 AM in response to Tom Gewecke

Thanks, this sounds like the underlying cause of the problem. Let's hope both Unicode extend their character sets, and Apple implement more structure on their side.

In the process I've now reported two bugs:

the initial issue with RIGHT()
that CODE() and CHAR() are not inverse maps where defined.

Reply

Answer 13

arkki Author

Level 1

12 points

Jan 31, 2025 7:29 AM in response to Tom Gewecke

I'm not familiar with kind of manipulation you are seeking to do in a spreadsheet with polytonic greek text. How would you describe the goal of that? Is it essentially to identify a specific number of characters or glyphs at the ends of words?

I'm learning the language. This is for making lists of lexemes and manipulating them to produce inflectional forms. Helps to spot similarities and differences.

Reply

Answer 14

Tom Gewecke

Level 10

122,318 points

Jan 31, 2025 8:36 AM in response to arkki

arkki wrote:

That'd work, though it somewhat defeats the purpose of Unicode.

Yes, it's annoying to need to have the font installed, or embedded in the doc. There is at least a woff version of New Athena, which you can put on a web server and thus force browsers to use it when viewing a web page you have created with it.

Reply

Answer 15

Tom Gewecke

Level 10

122,318 points

Jan 31, 2025 8:38 AM in response to arkki

arkki wrote:

I'm learning the language. This is for making lists of lexemes and manipulating them to produce inflectional forms. Helps to spot similarities and differences.

That sounds interesting. Are you working with an online digital corpus of some sort, or creating the text yourself?

Reply

Unicode diacritics using combining characters fail in Numbers

Similar questions