What is the sort order Finder uses?

Question

What is the sort order Finder uses?

I did a little researching of this question, and discovered that (unless I'm missing something), the sort order of Finder isn't what it is supposed to be.

The only specification I found is this archived tech note How Finder lists items that are sorted by name (Mac OS X) from 2012.

Almost everything there does indeed describe how Finder behaves, except for the note at the very bottom:

Technically speaking, Finder sorting is based on the Unicode Collation Algorithm, defined by the Unicode Consortium. This standard provides a complete and unambiguous sort ordering for all Unicode characters

And that complicates things, since (for example) the underscore character (officially known as "low line"), which sorts to the top and thus before all alphanumeric, has a Unicode value of U005F, which is right between the upper and lower case Latin characters: "Z" is U005A and "a" is U0061.

The reason I'm asking is because I'd like to find a character that sorts to after all Latin characters. Plenty of them sort to before, including <space>, underscore/<low line> and (my favorite for folders) <right-pointing double angle quotation mark> (i.e., "»").

If it was really sorting via Unicode, then it'd be simple to use the Character selector to insert a high-valued Unicode character, such as U25B9, which is "▹". But that one sorts to after the <space>, <low line>, and "»", but before all the Latin characters.

I have found one abstract character that works here for this, but I'm clueless as to why. At U1400 is the Unicode section entitled "Unified Canadian Aboriginal Syllabics", and U1433, the "Canadian Syllabics Po", is a character that shows up in Finder as a very large greater than symbol: "ᐳ". (The greater than symbol is ">" for comparison.)

So WHY?

Why does Finder think this is the correct (partial) order:

U0020 " " (Space)

U005F "_" (Low Line) <underscore>

U00BB "»" (Right-pointing double angle quotation mark)

U005C "\" (Reverse Solidus) <backslash>

U25B7 "▷" (White right-pointing triangle)

U0041 "A"

U007A "z"

U1433 "ᐳ" (Canadian Syllabics Po)

Note: I see that the Technote I linked to refers to the "Unicode Collation Algorithm", which might specify the answer somewhere in its depths. If that is the case, then I suppose that begs the question of why the Unicode coalition choose such an unexpected algorithm.

OS X Yosemite (10.10.5), null

Posted on Jun 29, 2016 11:29 PM

Reply

Answer 1

Tom Gewecke

Level 10

122,444 points

Jun 30, 2016 5:43 AM in response to Richard Wood (bis)

Richard Wood (bis) wrote:

what was wrong with just using Unicode values?

Those values are often arbitrary and the result of a couple decades of historical accretion. Have a look at all the different values in the "Digits" category, which logically should sort together. Emoji have very high numbers because they are the last characters created, but they belong before Latin with other symbols when sorting.

Emoji are in the General Symbol category if you scroll down far enough.

Reply

Answer 2

Tom Gewecke

Level 10

122,444 points

Jun 30, 2016 5:44 AM in response to Richard Wood (bis)

Richard Wood (bis) wrote:

This would especially be true as highly similar character sets are added and which, according to usage should be intermixed with earlier Unicode submissions.

Is this close to the rationale?

Yes, the numbers are historical, but sorting should be logical.

I think Mac and other OS's use ICU libraries for this task:

http://site.icu-project.org

Reply

Answer 3

Tom Gewecke

Level 10

122,444 points

Jun 30, 2016 5:05 AM in response to Richard Wood (bis)

Your "Note" has the answer. Unicode sorting is not determined by the codepoint values as such, it is determined by the Algorithm. This can be pretty complex, but I think one easy way to have things sort after Latin is to use a Greek character like µ (option m).

http://unicode.org/charts/collation/

Reply

Answer 4

Jun 30, 2016 5:05 AM in response to Tom Gewecke

Ah, interesting.

I suppose there's some rationale behind this algorithm. I can see how it works, but what was wrong with just using Unicode values? I'm sure the answer to that is buried somewhere in the logic of the algorithm, but it wasn't readily apparent.

I'd quickly realized that Greek letters would sort below Latin, but I'm a math teacher and those aren't very meaningless to me. The nice thing about simple geometric shapes is that they hold very little semantic value (or are so overloaded that it is easy to ignore it).

I was satisfied to discover some buried down below. Having keyboard access to one or two might be nice, but these days I'm used to summoning the Characters palette all the time for emoji anyway, and I've added a few to my "favorites" there.

Oh, I just noticed that emoji can be used in filenames as well. The first one I tested seems to come before Currency, Digits and Latin, but after General Symbols. Curiouser and curiouser — the table you point to doesn't seem to discuss emoji?

Reply

Answer 5

Jun 30, 2016 5:33 AM in response to Richard Wood (bis)

On further pondering, I realized what might be the rationale.

The world's alphabets (or character sets) could only gradually be added to the set of Unicode symbols, with no assurance that they'd be complete or even added in the proper order, so Unicode itself wouldn't be expected to be properly sorted.

This would especially be true as highly similar character sets are added and which, according to usage should be intermixed with earlier Unicode submissions.

Is this close to the rationale?

I'd guess this is usually implemented with a hash table lookup instead of nested switch/case statements?

Reply

Answer 6

Tom Gewecke

Level 10

122,444 points

Jun 30, 2016 5:49 AM in response to Richard Wood (bis)

Richard Wood (bis) wrote:

Ah, I'd arbitrarily chosen one and searched for it, but I'd used what appears to be some sort of alternate — I'd searched for "D83D"

When operating in a UTF-16 encoding framework, Unicode values with more than 4 digits are represented by two 4-digit values in succession (in this case d83d debe). An example is the Apple Unicode Hex input source, where you should be able to input 1f6be by holding down option and keying d83ddebe.

Reply

Answer 7

Jun 30, 2016 5:43 AM in response to Tom Gewecke

Ah, I'd arbitrarily chosen one and searched for it, but I'd used what appears to be some sort of alternate — I'd searched for "D83D" instead of "1F6BE".

Reply