String comparison with diacritics not working in PHP?

Hello everyone, I've been pondering about why the string comparison (with diacritics, a.k.a. special characters like ěščřžý) does not work properly in PHP on MacOS. I'm also working with Rocky Linux and Windows and on both the string comparison in PHP works as expected.


I'm reading the string (with diacritics) from MySQL database or from a file path. Then I try comparing it with strpos(), strcmp() and also with mb_strpos() and mb_strcmp() and none of them works.


Curiously, I've found 2 workarounds:


1) Compare with regular expression (preg_match()) and substitue the special character with ".*" or ".{2}", eg. instead of "SUŠG" I search for "SU.*G" - this works. Interestingly enough, with "SU.{1}G" it DOESN'T work. But when I use "SU.{2}G", it works again. Therefore I'm thinking that the whole issue comes from system locale or something similar, since regex thinks, the "Š" is actually two characters instead of one.


2) Before comparison, use the Normalizer::normalize() function. After putting my DB result (or file path) inside the normalize() function, all of the comparison functions work as expected.


However, even if I've found these two workarounds, none of them is the preferred way, since regexes shouldn't be used when there is easier (eg. more efficient) way and also, treating every involved line of code with normalize() just because of one OS, is insane and unacceptable - since it bloats the code and makes maintenance harder.


Can anyone, please, point me in the right direction? I've tried changing the locale with PHP setlocale( LC_ALL, 'cs_CZ' ), tried changing it also with terminal, and tried changing the whole system language from English to Czech but nothing works.


I've also found that C# code with diacritics (in comments and messages) that I write on Windows (with Visual Studio) shows on my Mac Visual Studio as "?" instead of the original character. I think both of these issues are related.


Do you know what is the root of this and how to fix it?

MacBook Pro 16″, macOS 13.4

Posted on Nov 24, 2023 12:22 PM

Reply
Question marked as Top-ranking reply

Posted on Jan 3, 2024 6:16 AM

Hi, thank you for your thorough reply, finally I've got some time to get back to this thread.


So, first things first:

I have to ask where you got PHP.

Indeed, I have the PHP through Homebrew since I didn't find any other way. Do you know of any? Using the 8.2.11 version.


About MySQL:

MySQL is the same problem. Not included with macOS. Where did you get it?

Well, I've installed the official package from here https://dev.mysql.com/downloads/mysql/

Using one of the latest versions - 8.0.27

All of my DB tables and columns are in utf8mb4_unicode_ci


Checking the character encodings of the files themselves and the tables+columns was really the first thing I did. All my PHP files are in UTF-8, for the last 15 years or so, since I work with "diacritics" (a.k.a. non-ASCII characters) most of the time.


then

You mentioned file paths too. I strongly recommend that you ignore file paths for now.

Well, file paths are the core of the whole algorithm since it's where everything starts, so for this particular situation, I really can't ignore them. Let me sum-up the inner workings of the related parts:


  1. There is a directory structure of 1500+ videos (archive of school works) on my local machine that is synced over Google Drive and multiple people access it to make small changes (add, rename, move files)
  2. After me or others make changes, I run PHP scripts that insert new files into DB, assign authors, sort them by school (2 schools) and branch etc
  3. Finally the data in the DB are used for the in-house web-based tools that allow us to easily group together various short films and order them within these groups. Last step is to send them via JSON to Premiere Pro where I've got another script that puts all of the videos on the timeline (in the given order), generates intro titles, makes some small tweaks etc


System locale has nothing to do with string encodings.


Thanks for making that clear - I made a wrong assumption and then tried to fix the problem in the place that did not have anything to do with the cause. Makes total sense I didn't figure it out that way!


So, here are the observations:


C# script showing question marks

This one was actually easy - for some reason my Windows Unity Editor saved the file in ASCII, so it was enough to change it to UTF-8.


PHP not comparing strings correctly

Probably as a result of reading the strings from file paths, strings got interpreted as sequence of two characters (letter "S" and caron "ˇ") instead of one character ("Š"). This was proven with the regex that showed it works only if I treat the unknown character as two characters actually, not just as one. My assumption is that I can either solve the problem by normalizing just before the string comparison, or before I store the strings in DB.


So my new question is: Do you know of any way how to have all the strings in PHP normalized globally without the need to treat individual lines of code?

Similar questions

4 replies
Question marked as Top-ranking reply

Jan 3, 2024 6:16 AM in response to etresoft

Hi, thank you for your thorough reply, finally I've got some time to get back to this thread.


So, first things first:

I have to ask where you got PHP.

Indeed, I have the PHP through Homebrew since I didn't find any other way. Do you know of any? Using the 8.2.11 version.


About MySQL:

MySQL is the same problem. Not included with macOS. Where did you get it?

Well, I've installed the official package from here https://dev.mysql.com/downloads/mysql/

Using one of the latest versions - 8.0.27

All of my DB tables and columns are in utf8mb4_unicode_ci


Checking the character encodings of the files themselves and the tables+columns was really the first thing I did. All my PHP files are in UTF-8, for the last 15 years or so, since I work with "diacritics" (a.k.a. non-ASCII characters) most of the time.


then

You mentioned file paths too. I strongly recommend that you ignore file paths for now.

Well, file paths are the core of the whole algorithm since it's where everything starts, so for this particular situation, I really can't ignore them. Let me sum-up the inner workings of the related parts:


  1. There is a directory structure of 1500+ videos (archive of school works) on my local machine that is synced over Google Drive and multiple people access it to make small changes (add, rename, move files)
  2. After me or others make changes, I run PHP scripts that insert new files into DB, assign authors, sort them by school (2 schools) and branch etc
  3. Finally the data in the DB are used for the in-house web-based tools that allow us to easily group together various short films and order them within these groups. Last step is to send them via JSON to Premiere Pro where I've got another script that puts all of the videos on the timeline (in the given order), generates intro titles, makes some small tweaks etc


System locale has nothing to do with string encodings.


Thanks for making that clear - I made a wrong assumption and then tried to fix the problem in the place that did not have anything to do with the cause. Makes total sense I didn't figure it out that way!


So, here are the observations:


C# script showing question marks

This one was actually easy - for some reason my Windows Unity Editor saved the file in ASCII, so it was enough to change it to UTF-8.


PHP not comparing strings correctly

Probably as a result of reading the strings from file paths, strings got interpreted as sequence of two characters (letter "S" and caron "ˇ") instead of one character ("Š"). This was proven with the regex that showed it works only if I treat the unknown character as two characters actually, not just as one. My assumption is that I can either solve the problem by normalizing just before the string comparison, or before I store the strings in DB.


So my new question is: Do you know of any way how to have all the strings in PHP normalized globally without the need to treat individual lines of code?

Nov 24, 2023 1:28 PM in response to jankosh

Post a few lines of php code—a reproducer, no database involvement, some constant strings, etc—showing the problem, and folks can take a look at that.


Have you installed the current php kit from the php.net site, and are using that?


Semi-related: https://stackoverflow.com/questions/3371697/replacing-accented-characters-php


Some background: https://tonsky.me/blog/unicode/


And yes, you’ll definitely want to normalize: https://www.php.net/manual/en/class.normalizer.php

Jan 3, 2024 8:51 AM in response to jankosh

jankosh wrote:

So my new question is: Do you know of any way how to have all the strings in PHP normalized globally without the need to treat individual lines of code?


If you can describe what “normalized globally” means, sure. Probably. Maybe.


You’re headed for Unicode normalization and then comparison in this PHP code.


This is one of the earliest well-known descriptions of this particular mess:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-khttps://www.php.net/manual/en/install.macosx.php


Also see the following, including the section on PHP: https://deliciousbrains.com/how-unicode-works/


Related “fun” is awaiting here: https://en.wikipedia.org/wiki/Unicode_equivalence


As for PHP and diacritics-insensitive comparisons: https://markrailton.com/blog/comparing-strings-that-may-or-may-not-contain-diacritics-in-php


Related: https://www.php.net/manual/en/book.intl.php


And for PHP on macOS: https://www.php.net/manual/en/install.macosx.php


TL;DR: fix your PHP code.


Nov 24, 2023 1:24 PM in response to jankosh

jankosh wrote:

Hello everyone, I've been pondering about why the string comparison (with diacritics, a.k.a. special characters like ěščřžý) does not work properly in PHP on MacOS.

There is no PHP on macOS.

I'm reading the string (with diacritics) from MySQL database or from a file path. Then I try comparing it with strpos(), strcmp() and also with mb_strpos() and mb_strcmp() and none of them works.

There really is no such thing as "string (with diacritics)". There is a sequence of bytes that may or may not have a string encoding. I strongly recommend using UTF-8 for said string encoding. Depending on what language you are using and its support for string encodings, those functions may or may not work.


PHP is ancient and isn't included with macOS. So this is an inauspicious start. I have to ask where you got PHP. Sadly, I think I know the answer. Homebrew is a non-stop source of problems here in the forums.


MySQL is the same problem. Not included with macOS. Where did you get it? yada, yada, yada. Now you have two systems that have to agree about string encodings. Both systems support different encodings. What are you using? Versions matter here. If you are using an older version of MySQL, the default encoding might be Swedish, because you know, it's originally from Sweden. Recent versions are UTF8 I think.


You mentioned file paths too. I strongly recommend that you ignore file paths for now. One problem at a time, please.

Curiously, I've found 2 workarounds:

1) Compare with regular expression

God no.

2) Before comparison, use the Normalizer::normalize() function. After putting my DB result (or file path) inside the normalize() function, all of the comparison functions work as expected.

Worry about file paths later. That will complicate your life immensely.

Can anyone, please, point me in the right direction? I've tried changing the locale with PHP setlocale( LC_ALL, 'cs_CZ' ), tried changing it also with terminal, and tried changing the whole system language from English to Czech but nothing works.

System locale has nothing to do with string encodings.

I've also found that C# code with diacritics (in comments and messages) that I write on Windows (with Visual Studio) shows on my Mac Visual Studio as "?" instead of the original character. I think both of these issues are related.

Yes, but only in the sense of string encodings.

Do you know what is the root of this and how to fix it?

You'll need to properly handle your string encodings. How to do that is an exercise for the programmer. It may not be straightforward. You're doing all of this for a website, I assume? You have to identify where your data is coming from and what you are doing with it. Is this user data? That would complicate matters immensely. Once you are dealing with the strings themselves, you have to be careful about how you manage the encoding. A string internal to the language has some internal representation. It is only when you interact with external sources of data where you may have to explicitly specify the encoding. The PHP/MySQL interface may automatically handle the internal representation. I can't say for sure. It's been a long time since I've used PHP.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

String comparison with diacritics not working in PHP?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.