Non-US-ASCII filenames in tar, scp, etc.

I've noticed that filenames my Debian Linux box can deal with fine seem to cause problems for OS X. The scp and tar commands supplied with OS X can't write these files to disk. In particular, I see this for filenames containing certain European characters, i.e. characters with accents, umlauts, etc.

braeburn:/Users/pronovic> scp otherbox:tmp/15* .
./15.L?ndler in_G,_Op.9,_No.21.ogg: Invalid argument

braeburn:/Users/pronovic> tar -ztvf music.tar.gz
tar: Record size = 8 blocks
-rw-rw-r-- pronovic/users 643100 2005-08-15 01:05:33 15.L\344ndler in_G,_Op.9,_No.21.ogg

braeburn:/Users/pronovic> tar -zxvf music.tar.gz
15.L\344ndler in_G,_Op.9,_No.21.ogg
tar: 15.L\344ndler in_G,_Op.9,_No.21.ogg: Cannot open: Invalid argument
tar: Error exit delayed from previous errors

This is rather frustrating. I'm not quite sure what to do with it.

I've tried using the tar implementation from Fink as well as the tar library in Python 2.3, all with the same result. I've also tried changing the locale, setting LANG=en and LC ALL=enUS, just like on the Debian box. No dice.

Anyone have a suggestion as to how I can make this work? Mac OS X has a UTF-8-capable filesystem, so it should be possible. I hope I am just missing something obvious.

Thanks,

KEN

Posted on Aug 22, 2005 1:45 PM

Reply
18 replies

Sep 7, 2005 10:46 PM in response to Kenneth Pronovici

Same problem here. Unable to paste touch ~/Desktop/tester_éü.txt.
Changing window setting to "Escape non-ASCII characters" makes
paste works. You will see
touch ~/Desktop/tester_\303\251\303\274.txt
which will generate the desired file name in Finder. Funny thing is that dragging that file onto the terminal widow will generate an entirely different path:
~/Desktop/tester_e\314\201u\314\210.txt
but touching that path will again generate correct file name. Oh the beauty of Unicode in UNIX!

Now my problem is that I want to copy files from a shell script. This doesn't work with accented characters. Is there

a) a way to convert the path programatically into an "escaped" form?

or

b) way to lexically strip accents. Again in unix script.

Sep 30, 2005 5:35 AM in response to Kenneth Pronovici

Hi Kenneth,
\303\244 is the octal representation of the hex digit, \uC3A4, which is the multibyte UTF-8 encoding of the unicode 228 character. \344 is simply the octal representation of 228, which is not UTF-8 encoded. Thus, the problem lies in the failure of the Debian box to use UTF-8 encoding. It may help to use the following settings on the Debian box:

LANG=en_US.UTF-8
LC ALL=enUS.UTF-8

--
Gary
~~~~
If you don't go to other men's funerals they won't go to
yours.
-- Clarence Day

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Non-US-ASCII filenames in tar, scp, etc.

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.