Non-US-ASCII filenames in tar, scp, etc.

I've noticed that filenames my Debian Linux box can deal with fine seem to cause problems for OS X. The scp and tar commands supplied with OS X can't write these files to disk. In particular, I see this for filenames containing certain European characters, i.e. characters with accents, umlauts, etc.

braeburn:/Users/pronovic> scp otherbox:tmp/15* .
./15.L?ndler in_G,_Op.9,_No.21.ogg: Invalid argument

braeburn:/Users/pronovic> tar -ztvf music.tar.gz
tar: Record size = 8 blocks
-rw-rw-r-- pronovic/users 643100 2005-08-15 01:05:33 15.L\344ndler in_G,_Op.9,_No.21.ogg

braeburn:/Users/pronovic> tar -zxvf music.tar.gz
15.L\344ndler in_G,_Op.9,_No.21.ogg
tar: 15.L\344ndler in_G,_Op.9,_No.21.ogg: Cannot open: Invalid argument
tar: Error exit delayed from previous errors

This is rather frustrating. I'm not quite sure what to do with it.

I've tried using the tar implementation from Fink as well as the tar library in Python 2.3, all with the same result. I've also tried changing the locale, setting LANG=en and LC ALL=enUS, just like on the Debian box. No dice.

Anyone have a suggestion as to how I can make this work? Mac OS X has a UTF-8-capable filesystem, so it should be possible. I hope I am just missing something obvious.

Thanks,

KEN

Posted on Aug 22, 2005 1:45 PM

Reply
18 replies

Aug 23, 2005 4:02 AM in response to Kenneth Pronovici

Hi

I too would be interested in the solution to your problem.

What did you do to make your debian machine recognize umlauts? I have noticed that if i save a vim documents on the debian machine with an umlaut in the filename its not a problem, but on the filearea provided by samba umlauts dont work, and my Mac-OS-X machine doesnt seem to like umlauts in when doing a ls in the terminal or looking in a catalogue via webbrowser. But the finder handles umlauts ok. Weird stuff 🙂

Aug 23, 2005 8:07 AM in response to snigelman

On the Debian box, setting the locale was all I needed to do. Check out 'man locale' for some more details. Like I mention in my original post, I set $LANG and $LC_ALL, and then a few other $LC_ variables (which in all honesty are probably overridden by $LC_ALL).

I agree, it's strange that it's just the command-line tools that don't seem to like these characters. It makes me think I have something broken in my shell enviroment. I can't find any useful information on the subject, though (I spent most of two hours googling yesterday afternoon).

KEN

Aug 23, 2005 8:12 AM in response to nobody loopback

Can you clarify? Does it work for you on the original filesystem, or only on the UFS filesystem? Does it work in the finder, from the command-line, or both?

Can you think of anything you changed in the environment, or via System Preferences, or in your ~/.profile, etc. that I might try, or are you confident that it always worked for you out-of-the-box?

Thanks for the reply.

KEN

Aug 23, 2005 9:56 AM in response to CloitusDisruptus

Fascinating!

I'm not entirely sure how to enter a filename like that (how to compose the characters). If I cut-and-paste from your post, however, I get somewhere. Pasting into a Terminal running kshell (/bin/ksh) doesn't work. Pasting into a Terminal running bash (/bin/bash) does work.

Once I've created the file, I can also create a tar file of the newly touched file and extract it under either bash or kshell on OS X. That's a step in the right direction. It implies that the filesystem does understand how to deal with the special characters.

If I scp the tar file back to Linux and extract it there, however, the result doesn't look right. It comes out like this:

tester_eÌ?uÌ?.txt

It gets more interesting. If I then attempt to scp this file off the Linux box back to OS X, it works! The resulting file is called tester_éü.txt just like the original one!

This is actually beginning to make some sense. This behavior implies that Linux and OS X encode these special characters differently, causing the problem. OS X is reported to have a different Unicode encoding method than other operating systems, sometimes using two bytes where other operating systems use one byte. That seems to match up with what I see.

Now if I could just figure out what to do about it...

KEN

Aug 23, 2005 4:05 PM in response to Kenneth Pronovici

Hi Ken,

You're right, that does make sense, unfortunately I don't know the solution; if the same Unicode or UTF8 code is handled differently on different platforms then I guess I'm not that surprised... Some common ground needs to be found.

To get the accented characters in Terminal you need to type a character sequence: 'option/alt' + modifier, then the actual letter, for example: option + e, then the 'a' key to get á. The 'u' modifier gives the umlaut. I can't remember the rest I'm afraid, but you can probably find them by trial and error or on the net in no time. You can also get these characters by using the 'Character Palette' which is found at the bottom of the edit menu in most cocoa apps (it doesn't seem to give the key sequence though which is a shame because it's pretty impressive all the same).

Aug 23, 2005 4:07 PM in response to Kenneth Pronovici

I tried out this on a HFS+ filesystem.
I recreated a file with the same name as in your example on a linux machine (SuSe linux). Then compressed that to test.tar.gz.
Then unpacked on my OSX machine OSX 10.4.2
The "ä" umlaut shows correct under the finder. In the shell (tcsh) its only "??"
In the terminals preferences the character set is set to UTF-8
Terminal VT100
I dont think there are any special things in my evironment.

Aug 23, 2005 4:21 PM in response to nobody loopback

If you want to try out my test.tar.gz:
The test.tar.gz containing a file with the name Ländler...
uuencoded (I hope this survives the transfer trough the discussions board)
<code>
begin-base64 644 test.tar.gz
H4sIAE/gCkMAA+3OUQqCQBDG8X3eU3iAkFHXlk7QS1Q3WBY2TZAUtRt1ky7WBvVoT2Iv/x8zfAwM
zByej1toL4Nrbm6/cadzunPHLs2ztKtrtQzJRLbGKImyT0bfFCmsyqSQvCytNXncLwtrVCIL3f/p
Pk5 SBLVX5u26fvZvaHrpjXWZkPbah8GCuvY42V9uHdOo46aP3v9wAAAAAAAAAAAAAAAAAAM160
/tNcACgAAA==
====

</code>

Aug 23, 2005 5:38 PM in response to Kenneth Pronovici

Ken

This behavior implies that Linux and OS X encode these special characters differently


Yes, but you go on to say "sometimes using two bytes where other operating systems use one byte", which I am not so sure about: I suspect most operating systems will use two bytes for these.

These multibyte characters can be represented in more than one way in Unicode. OS X uses canonical ordering, which obviously Linux doesn't. But OS X understands the other ordering, so for example 'ls' will work to display a file or folder name. But it cannot then use that name to look up the file, since it turns the displayed characters back to the canonical ordering, and the bytes do not match.

The Finder and GUI use the old Mac OS concept of a File ID, which has no corresponding feature in Unix, and so has no trouble moving these files and folders about. But for any operation involving a directory lookup, which requires a file name, the match fails.

It isn't clear that you can do anything about it, except avoiding multibyte characters.

Aug 23, 2005 6:23 PM in response to Michael Conniff

Well, other operating systems can certainly represent some of these characters with one byte (via national codepages, etc.). In fact, I'm fairly sure that Linux is using a single byte in this case, although I can't provide a definitive reference proving that.

In any case, I think you're right. I probably can't do anything about it, short of rewriting tar or scp to do translation. That's rather disappointing, but I guess I'll have to live with it.

KEN

Aug 23, 2005 6:33 PM in response to nobody loopback

Aha!

You have a two-byte sequence for the ä character:

braeburn.local:/Users/pronovic> tar -zxvf test.tar.gz
L\303\244ndler in_G,_OP.9No.21.ogg

However, my file has a single byte for that character:

braeburn.local:/Users/pronovic> tar -zxvf music.tar.gz
15.L\344ndler in_G,_Op.9,_No.21.ogg
tar: 15.L\344ndler in_G,_Op.9,_No.21.ogg: Cannot open: Invalid argument
tar: Error exit delayed from previous errors

This means that you created the file differently in your SuSE environment than I did in my Debian environment. How did you compose the filename? Can you tell me what you have $LANG and $LC_ALL set to? Perhaps the locale is influencing the way the filename is constructed.

I feel like I'm getting closer. I think that the solution might be to ensure that if I use characters that aren't part of US ASCII, that I need to be sure they're encoded as multi-byte characters, in order to use them on OS X.

Aug 24, 2005 4:16 AM in response to Kenneth Pronovici

I tried another experiment:
This time, I created the testfile not using ssh from my mac, but using ssh from windows, then tar and gzipped the file again and copied it to my mac.

Now, I have the same problem as you.
It seems (but of course it should work like this), that the way the shells encoding is setup will also influence the way a filename is stored.

The only way to overcome this is my workaround:
Create a UFS formatted disk Image.
unpack the file there.
Use the finder or cp to copy the file to the HFS+ Disk.
you are done.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Non-US-ASCII filenames in tar, scp, etc.

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.