UTF-8 in terminal (bash)

Question

Level 2

316 points

UTF-8 in terminal (bash)

Hi,

I have some files with UTF-8 encoding. I wrote some shell scripts for filtering and managing their contents. However, the output of the scripts do not create UTF-8 encoded files, and some of the characters are unreadable.

Is there a way to tell bash to use UTF-8 as default encoding? I tried

export LANG=en_us.UTF-8

but no luck!

Thanks in advance...

MacBook Pro 2.16,ibook g4 1.33, Mac OS X (10.4.8), wireless mighty mouse and 1/2 ipod nano black ;)

Posted on Dec 19, 2006 11:13 PM

Reply

Answer 1

Dec 21, 2006 8:05 AM in response to hsaybasili

In Terminal Windows Settings for Display Character Set Encoding allows UTF-8 as an option. It was set that way on mine by default. If you have that set then you may need to use escape characters for the script to work properly (just guessing on this).

macBook Pro Mac OS X (10.4.8) 2 GHz Duo Core Duo

Reply

Answer 2

hsaybasili Author

Level 2

316 points

Dec 22, 2006 11:35 AM in response to Stanley Williams

This is by default UTF-8 in mine also. However, when I do:

cat my UTF-8file | some_operations here > new_file

After opening new_file, I see that all UTF-8 characters have disappeared! Also, the output of

cat my UTF-8file

does not display non english characters.

In Terminal Windows Settings for Display Character
Set Encoding allows UTF-8 as an option. It was set
that way on mine by default. If you have that set
then you may need to use escape characters for the
script to work properly (just guessing on
this).

macBook Pro
Mac OS X (10.4.8) 2 GHz Duo Core
Duo

Reply

Answer 3

Dec 22, 2006 1:28 PM in response to hsaybasili

This is tricky,

Low level string parsers will conflict if "%^-1"="UTF-8"
It is very unpredictable (for me at least) when UTF-8 stays UTF-8 when stored.
How do you know whether you sending "%^-1" or "UTF-8" to a module?
Tried to set the flag manual?
I am curious, anyone?

JP

Reply

Answer 4

Jun T.

Level 4

2,191 points

Dec 23, 2006 5:32 AM in response to hsaybasili

So you mean in "Windows Settings (aka Terminal Inspector) > Display",
UTF-8 is selected for "Character Set Encoding"?

Then which font are you using in the same "Display" setting?
(most fonts should work, but try Monaco if you are using other fonts).

In bash, please try

echo $'\xc3\xa9'

This should produce an é (e with acute). What do you get?

PowerMac G4 Mac OS X (10.4.8)

Reply

Answer 5

Dec 23, 2006 9:30 AM in response to hsaybasili

Hi hsaybasili,

> cat myUTF-8file | some_operations here > new_file

It is impossible to know what is being done to the file contents by your "some_operations" so we can't say anything about the result. However, it is ominous that the simple cat command fails:

> cat myUTF-8file

On my system, this faithfully reproduces unicode characters even in a bash shell for which the only hint at encoding is that LC_ALL has been set to 'C'. I gather that the thing that makes it work is that the "Character Set Encoding:" has been set to UTF-8 in the "Display" pane of the Terminal's "Window Settings..." "Terminal Inspector" dialog. However, you indicate that this didn't help on your system so I'm forced to question your my UTF-8file.

Are you sure that your file is still encoded in UTF-8? You can check by opening the file in vim and executing:

:set fileencoding?
--
Gary
~~~~
Nitwit ideas are for emergencies. You use them when
you've got nothing else to try. If they work, they go in the
Book. Otherwise you follow the Book, which is largely
a collection of nitwit ideas that worked.
-- Larry Niven, "The Mote in God's Eye"

Reply

Answer 6

hsaybasili Author

Level 2

316 points

Dec 23, 2006 10:50 AM in response to Jun T.

I got the é on the display...

So you mean in "Windows Settings (aka Terminal
Inspector) > Display",
UTF-8 is selected for "Character Set Encoding"?

Then which font are you using in the same "Display"
setting?
(most fonts should work, but try Monaco if you are
using other fonts).

In bash, please try

echo $'\xc3\xa9'

This should produce an é (e with acute). What do you
get?

PowerMac G4
Mac OS X (10.4.8)

Reply

Answer 7

hsaybasili Author

Level 2

316 points

Dec 23, 2006 10:55 AM in response to Gary Kerbaugh

It is impossible to know what is
being done to the file contents by your
"some_operations" so we can't say anything about the
result.

what I am doing is:

cat file.txt | cut -c"\"" -f2 | grep keyword> newfile.txt

On my system, this faithfully reproduces unicode
characters even in a bash shell for which the only
hint at encoding is that LC_ALL has been set to 'C'.
I gather that the thing that makes it work is that
the "Character Set Encoding:" has been set to UTF-8
in the "Display" pane of the Terminal's "Window
Settings..." "Terminal Inspector" dialog. However,
you indicate that this didn't help on your system so
I'm forced to question your my UTF-8file.

This is strange! I tried with another UTF-8 html file of mine, and it could display the characters! I will check the encoding, maybe there was some error with the files...

Are you sure that your file is still encoded in
UTF-8? You can check by opening the file in vim and
executing:

:set fileencoding?

I just tried and the result is nothing! :

fileencoding=

--
#0066FF; text-shadow: 3px 3px 2px
#666">Gary
~~~~
Nitwit
ideas are for emergencies. You use them
when
you've got nothing else to
try. If they work, they go in
the
Book. Otherwise you follow
the Book, which is largely
a
collection of nitwit ideas that
worked.
&
nbsp; -- Larry Niven, "The Mote in God's Eye"

MacBook Pro 2.16,ibook g4 1.33 Mac OS X (10.4.8) wireless mighty mouse and 1/2 ipod nano black 😉

Reply

Answer 8

Dec 23, 2006 2:39 PM in response to hsaybasili

Hi hsaybasili,
Maybe your newfile.txt contains only the error message from the "cut" command. The "-c" option selects character positions while the "-f" option selects by field number. The two are incompatible and cannot be used together. Of course this error message contains only ASCII characters.

The fact that vim suggests no encoding suggests to me that the multibyte characters have already been butchered and vim thinks the entire file consists of single-byte characters. Single-byte encodings can't be differentiated so I guess vim doesn't make a choice. This only fuels my doubts as to whether the file is still encoded as UTF-8.
--
Gary
~~~~
And it should be the law: If you use the word `paradigm'
without knowing what the dictionary says it means, you
go to jail. No exceptions.
-- David Jones

Reply

Answer 9

Dec 23, 2006 4:37 PM in response to hsaybasili

Hi hsaybasili,
I have one more thought that is a reach. The HFS+ filesystem, at the deepest level, uses a "decomposed" format, where accented characters consist of the character being accented and the "combining" accent character, in that order. The combining accent character is above ASCII and is encoded in UTF-8 so technically the whole thing is "legal UTF-8", just unusual. Maybe your file was created with a text editor that uses this "decomposition" but UNIX text editors don't know it.

Come to think of it, was this file the result of output from the "ls" command? That output is also always "decomposed" so that would explain why UNIX editors might not deal with it as expected.
--
Gary
~~~~
It is so soon that I am done for,
I wonder what I was begun for.
-- Epitaph, Cheltenham Churchyard

Reply

Answer 10

Jun T.

Level 4

2,191 points

Dec 23, 2006 6:58 PM in response to hsaybasili

I got the é on the display...

Then your Terminal is setup correctly. The cat command just sends your file to Terminal (bash is not involved here), so the problem should be in your file.

Reply

Answer 11

hsaybasili Author

Level 2

316 points

Dec 24, 2006 2:31 AM in response to Gary Kerbaugh

Hi Gary,

Thank you very much for all this valuable informations. Let me explain you the situation. One of my friends has some old Windows files ( approx 1000 of them) with western encoding with Turkish characters. Each line of these files contain one output of an old DOS program, all the values inside "" and separated by a comma. What he wanted, was to convert these files to Microsoft Office format as he works in Windows. So, I wrote a shell script which will parse these files, and convert them to LaTex files, adding required Titles, entries (for example first field of the files is the name field, let say "John Smith". It will be translated to: \textbf{\textit{name: }} John Smith\\). Then, using latex2rtf, I could convert it to RTF file.

I opened them with text editor, and I saw that Turkish characters are unreadable. I replaced these characters (Ä -> Ç for example), and saved the file in UTF-8 format with Text Editor. Then I run:

cat -ev file > newfile

for being able to distinguish the end of line. I use the extra characters added by cat command at the end of each line for letting the script know that all datas has been taken from a line. Then I run the script newfile being the input, but I saw that the output latex files do not have the correct characteres.

I tried now, and saw that -ev parameter to cat command causes this error. If I use simply cat, I can see the Turkish characters, but if I use cat -ve I can't.

I think I have to find another way instead of cat -ev .

Thank you very much for your help....

haris

Hi hsaybasili,
I have one more thought that is a
reach. The HFS+ filesystem, at the deepest level,
uses a "decomposed" format, where accented characters
consist of the character being accented and the
"combining" accent character, in that order. The
combining accent character is above ASCII and is
encoded in UTF-8 so technically the whole thing is
"legal UTF-8", just unusual. Maybe your file was
created with a text editor that uses this
"decomposition" but UNIX text editors don't know it.

Come to think of it, was this file
the result of output from the "ls" command? That
output is also always "decomposed" so that would
explain why UNIX editors might not deal with it as
expected.
--
Gary
~~~~
It is
so soon that I am done for,
I
wonder what I was begun
for.
&nbs
p; -- Epitaph, Cheltenham Churchyard

Reply

Answer 12

Dec 24, 2006 8:17 AM in response to hsaybasili

Hi Haris,
I did a little more research and found that four-byte characters do cause vim to choke, although it said that the file was UTF-8. The "less" utility actually warned me that the file was a binary and showed me a hexdump of the file. Also, naturally "cat -e" mangled the file but "cat" with no argument produced the contents unchanged and the Terminal did a magnificent job of rendering them. It had to pull the characters from a variety of fonts but it did so.

Ah yes, a CSV file; know them well. The characters Ä -> Ç are only two-byte characters in UTF-8 so lots of UNIX tools, like vim, should be able to handle them.

The end of the line in text files is marked by special characters that are system-dependent and are ASCII control characters. UNIX uses the newline character (also called a linefeed) and classic Mac used a carriage return. Windoze uses both carriage return and newline, in that order. You can determine which format a particular uses by opening it with vim and executing:

:set ff?

If lines end in a carriage return and a newline, vim should report the fileformat as dos. If vim shows you the fileformat correctly, you can also use vim to change it by executing the following in vim:

:set ff=unix
:w

You can also change the line endings with Perl. For instance, the following command convert DOS files to UNIX files:

perl -pi -e 's/\r//g;' /<Path>/<to>/<File>

Tools such as grep and cut are line editors, operating one line at a time, so don't need any special demarcation but the line endings must be UNIX for them to work correctly. Since Windoze files do have a newline at the end of each line, line editors will work on them but they will see the carriage return as part of the text of the line so you'll only get the results you expect if you account for that character in any regular expressions that match to the end of the line. Note that carriage returns match wildcards that don't specify printable characters.
--
Gary
~~~~
<Espy> we need to split main into "core" and "***–uses–this"

Reply

Answer 13

hsaybasili Author

Level 2

316 points

Dec 24, 2006 11:17 AM in response to Gary Kerbaugh

Thank you very much Gary, this was really helpful...

And Merry Christmas by the way 🙂

Thanks again...

Reply