C++ Character Encoding Issues in Mac

I'm developing a Cross Platform File Sync Application.In Mac OS X to get File System events, i read from /dev/fsevents system buffer and send it over unix sockets to another app. I'm not doing any character encoding until now.



This is my Print in app which recieves the FS Events :

######## File Name ::: ébê123.rtf

######## File Name in WCHAR ::: ébê123.rtf


code which i used to convert char to wchar


int wCharLen1 = mbstowcs(NULL, fName, 0); // fName is the char which i recieved through unix socket

WCHAR* fileName = new WCHAR[wCharLen1 + 1];

memset(fileName,'\0',(wCharLen1 + 1) *sizeof(WCHAR));

mbstowcs(fileName, fName, wCharLen1);


I'm sending the file name to my Server and have printed the file name before DB Insert, which prints the exact file name :

######## Recieved File Name ::: ébê123.rtf

But in DB it inserts the file Name as 'ébê123.rtf'


I'm using the same code in Windows except i don't have to do wchar conversion, because the Windows Directory Monitoring itself gives the file name in wchar. I don't have any issues with the windows client and the file name is inserted correctly in the database as ' ébê123.rtf '. I suspect that i'm missing some encoding before converting char to wchar in Mac. I have tried encoding to UTF-8 , but the file Names have changed to

######### FileName ::: ébê123.rtf after Encoding TO UTF-8 ::: ébeÌ‚123.rtf [MAC]



Another Case :

When uploading files from Windows with the above file name 'ébê123.rtf' , the file gets downloaded in Mac with the correct file name. But when the file is uploaded from Mac , then the file name seems to be downloaded correctly in Windows, but as soon as i change anything in that file, the file name is sent as 'e%cc%81be%cc%82123.rtf' to Server,then to Mac. But if i originally create the file 'ébê123.rtf' in Windows, then it is sent correctly.



I suspect i have to encode the file name in mac to UTF-8 string before converting char to wchar in Mac. But i have tried some open source code like the one below :

void latin1_to_utf8(unsigned char *in, unsigned char *out)

{

while (*in)

{

if (*in<128)

{

*out++=*in++;

}

else

{

*out++=0xc2+(*in>0xbf);

*out++=(*in++&0x3f)+0x80;

}

}

*out = '\0';

}


And it didn't worked. Now i'm looking for a library or some code to convert the string to utf-8 string in C++ in Mac.Of Course this function works when the file name is recieved from Windows to Mac. Any ideas ..?

MacBook Pro, OS X Mountain Lion (10.8.2)

Posted on Jan 16, 2013 4:22 PM

Reply
6 replies

Jan 17, 2013 4:37 AM in response to manisraj

Hello


It seems to me an issue regarding Normalization Form D (NFD) and Normalization Form C (NFC) of Unicode character.

E.g.,


é (NFC) = U+00E9 = <c3 a9> (UTF-8)
é (NFD) = e + ´ = U+0065, U+0301 = <65 cc 81> (UTF-8)

ê (NFC) = U+00EA = <c3 aa> (UTF-8)
ê (NFD) = e + ˆ = U+0065, U+0302 = <65 cc 82> (UTF-8)


HFS+ name is in NFD (decomposed form).

I suspect your DB, and possibly Windows to some extent in your 'another case', are not handling NFD properly.


To convert NFD to NFC, you may use iconv(3) or iconv(1).

E.g., using iconv(1)


#!/bin/bash
iconv -f UTF-8-MAC -t UTF-8 nfd.txt > nfc.txt


Here UTF-8-MAC, which should have been properly named UTF-8-NFD, is UTF-8 in NFD.

You should be able to do the same conversion by using iconv(3).


cf.

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/ man1/iconv.1.html

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/ man3/iconv.3.html

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/ man3/iconv_open.3.html


Hope this may help,

H

Jan 17, 2013 11:57 AM in response to Hiroto

Hi Hiroto,


Thanks for the response. I have tried encoding the file name from UTF-8-MAC to UTF-8 with iconv, but still i'm getting the same issue in DB.



File Name : äåéööéåä123.txt


In DB : äåéööéåä123.rtf


Here is the code i have used for iconv encode :


string MacEncode(string _strToEncode)

{

iconv_t convDes; /* conversion descriptor */

convDes = convDes = iconv_open("UTF-8-MAC", "UTF-8");

if (convDes == (iconv_t)(-1))

{

cout << "Cannot open iconv converter for utf-8-mac to utf-8 \n";

return "";

}



char* inpStr =(char*) _strToEncode.c_str();



size_t inpLen = strlen(inpStr);

size_t outLen = (2*inpLen)+1;



char* outBuf = new char[outLen];

memset(outBuf,'\0',outLen);



char* outBuffer = outBuf;

int retCode;

retCode = iconv(convDes, &inpStr, &inpLen, &outBuffer, &outLen);

if(retCode == -1)

{

return "";

}

string outputbuf = outBuf;

return outputbuf;

}



Am i making anything wrong with the code ..?



I'm using MySQL Server, how can i make my mysql server support Normalization Form D. If i can do this , i can fix it right ..?



User uploaded file

Jan 17, 2013 1:58 PM in response to manisraj

It would be better to use a higher-level interface than /dev/fsevents. You certainly don't have that on Windows or, if you do, you are chosing to use something higher-level. You can do the same thing on a Mac. If you use the FSEvents API, your strings will be in the form of CFStrings. You can work with them there and change the normalization using CFStringNormalize() if you want. You could cast them to NSStrings and work with them in Objective-C too.

Jan 18, 2013 9:31 AM in response to manisraj

Hello


Yes, your arguments for iconv_open() are reversed!

It is iconv_open(to, from) and so your code should be:


iconv_t convDes = iconv_open("UTF-8", "UTF-8-MAC");


By the way, you can judge whether your problem is NFD-NFC related by testing with file name such as αβγ.rtf, which contains characters whose NFC and NFD are identical and yet in range beyond U+0080.


Good luck,

H


P.S. Also you should call iconv_close() on your convDes at end.


Message was edited by: Hiroto (added P.S.)

Jan 21, 2013 2:08 AM in response to Hiroto

Hi Hiroto,


Sorry i mistyped in the code posted in here, in my actual code, i have set the arguments correctly.


And i have tried created a file named "αβγ_Mac.rtf" in my FS Events watch folder and while converting it from "UTF-8-MAC" to "UTF-8", i got this in my DB


αβγ_Mac.txt


And thanks for the tip on closing the iconv descriptor, i really missed that one.

Jan 21, 2013 12:24 PM in response to manisraj

Hello


It indicates there's the following conversion taking place:


αβγ = U+03B1, U+03B2, U+03B3
= <ce b1 ce b2 ce b3> (UTF-8)
~ <ce b1 ce b2 ce b3> (ISO-8859-1) = αβγ


which would mean your DB or some intermediate converter is interpreting UTF-8 byte sequence as ISO-8859-1 (Latin 1) byte sequence.

This is different problem other than NFD-NFC issue.


Firstly you'd better check your MySQL Server is properly configured to use UTF-8 as the character set.


That's all for now.

Good luck,

H

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

C++ Character Encoding Issues in Mac

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.