Convert csv file from us-ascii to UTF-8

Question

Level 1

14 points

Convert csv file from us-ascii to UTF-8

I'm just trying to convert a file over I run this through terminal

StudioA:~ StudioA$ file -I /Users/StudioA/Desktop/Mikey_WK37.csv

and the result is

/Users/StudioA/Desktop/Mikey_WK37.csv: text/plain; charset=us-ascii

So I run this

iconv -f US-ASCII -t UTF-8 /Users/StudioA/Desktop/Mikey_WK37.csv > /Users/StudioA/Desktop/new_file.csv

then when I check the new file it is still in us-ascii.

Where am I going wrong?

Mac Pro, OS X El Capitan (10.11.1)

Posted on May 5, 2016 6:29 AM

Reply

Answer 1

Tom Gewecke

Level 10

115,678 points

May 5, 2016 6:53 AM in response to MattJayC

TThere is no difference between us-ascii and UTF8. The former is just a subset of the latter.

Reply

Answer 2

MattJayC Author

Level 1

14 points

May 5, 2016 7:50 AM in response to Tom Gewecke

Can I still convert it? or at least change the character set as I have script that uses the UTF-8 set.

Reply

Answer 3

MattJayC Author

Level 1

14 points

May 5, 2016 8:40 AM in response to Tom Gewecke

The line in the script is

if existsCSV then set o'scsvText to paragraphs of (readcheckListFileas «class utf8») -- get the contents of the CSV file ***

It won't let me change it to us-ascii

Reply

Answer 4

Tom Gewecke

Level 10

115,678 points

May 5, 2016 9:09 AM in response to MattJayC

Does your csv file actually include a statement saying it is us-ascii? If so you could just change that to utf-8.

Reply

Answer 5

MattJayC Author

Level 1

14 points

May 5, 2016 1:53 PM in response to Tom Gewecke

If I create a file in text edit in plain text type 1 or whatever then save as 'Western (Mac OS Roman), then run file -I theFile it is us-ascii

This is how most of the files I get look, I need to use an extra character "✔" and when you add that file, it then asks for it to be saved in another code (Such as UTF-8)

Tom Gewecke wrote:

Does your csv file actually include a statement saying it is us-ascii? If so you could just change that to utf-8.

If I can just change it. How can I do that in terminal

Thanks

Matt

Reply

Answer 6

Tom Gewecke

Level 10

115,678 points

May 5, 2016 1:58 PM in response to MattJayC

HHow about just replacing the text "us-ascii" by "utf-8"

Reply

Answer 7

MattJayC Author

Level 1

14 points

May 5, 2016 2:02 PM in response to Tom Gewecke

Excuse the correct answer I accidentally clicked it.

I still don't know how I just change the text?

Reply

Answer 8

Tom Gewecke

Level 10

115,678 points

May 5, 2016 2:24 PM in response to MattJayC

Sorry, I don't know much about doing such things in terminal. Isn't there some kind of find/replace operation you can do on a file's contents?

Reply

Answer 9

May 6, 2016 1:01 AM in response to MattJayC

Here is a Python program that reads in US-ASCII CSV and outputs it as a UTF-8 quoted field, Excel compatible CSV. The output opens and formats nicely in Numbers v3.6.1, and LibreOffice Calc v5.1.2.2.

Usage: ucsv.py input.csv output.csv

Copy and paste the following Python code into a programmer's editor. If it is Sublime Text 3, then use Paste and Indent. Otherwise, paste into a TextEdit plain text file, and save as ucvs.py. Make the Python script executable in the Terminal.

Test this on a small CSV and open it in a spreadsheet application to verify it works ok for you.

Code:

#!/usr/bin/env python
# coding: utf-8
'''
ucsv.py


Read in a US-ASCII CSV document and write out a quoted field Excel CSV.
Output CSV read correctly by Numbers v3.6.1, LibreOffice Calc 5.1.2.2.


Usage: ucsv.py us-ascii-input.csv utf8_output.csv
Derived from : https://docs.python.org/2.7/library/csv.html#examples
http://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7
'''


import csv
import codecs
import cStringIO
import os
import sys




class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)


    def __iter__(self):
        return self


    def next(self):
        return self.reader.next().encode("utf-8")




class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)


    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]


    def __iter__(self):
        return self




class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()


    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)


    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


if len(sys.argv) < 3:
    sys.exit("{} <ascii-csv> <utf-csv>\n".format(sys.argv[0]))


if os.path.exists(sys.argv[1]) and sys.argv[1].endswith('.csv'):
    ascii_csv = os.path.expanduser(sys.argv[1])
    utf8_csv = os.path.expanduser(sys.argv[2])
else:
    sys.exit("One or both of the input files do not exist.")


with open(ascii_csv, 'rb') as fin, open(utf8_csv, 'wb') as fout:
    reader = UnicodeReader(fin)
    writer = UnicodeWriter(fout, quoting=csv.QUOTE_ALL)
    for line in reader:
        writer.writerow(line)

Reply

Answer 10

Tom Gewecke

Level 10

115,678 points

May 5, 2016 4:45 PM in response to VikingOSX

VikingOS X -- How would the output of this program differ from the input, since nothing changes when us-ascii is relabelled as utf-8?

Reply

Answer 11

MattJayC Author

Level 1

14 points

May 6, 2016 1:02 AM in response to VikingOSX

This does it now, many thanks. hopefully it will run with no problem in the other script.

Many Thanks

Reply

Answer 12

May 6, 2016 7:07 AM in response to Tom Gewecke

Tom,

Numbers v3.6.1 (and LibreOffice Calc) always informs that the Export to CSV default encoding is UTF-8. If only US-ASCII characters are used in the spreadsheet, then the UNIX file utility will report the CSV as ASCII text with CRLF line terminators. If the spreadsheet contains US-ASCII and non-US-ASCII characters, then the UNIX file utility will identify the CSV as UTF-8 Unicode text with CRLF line terminators.

The Python script does some more things, but in essence, it takes a pure US-ASCII CSV, and re-encodes it to 8-bit ASCII characters that now inform the UNIX file utility to report the above UTF-8 message. Rather pointless.

In retrospect, and some discovery on my part after I posted the code, it is unnecessary, as the current Numbers and LibreOffice Calc are adapting the exported CSV encoding as the content demands.

Reply

Answer 13

Tom Gewecke

Level 10

115,678 points

May 6, 2016 8:45 AM in response to VikingOSX

Thanks for the explanation. I wonder if the script adds a BOM to the beginning of the file. I remember now that some apps require that to recognize "utf-8", even if the content is really only ascii.

Reply

Answer 14

May 6, 2016 9:53 AM in response to Tom Gewecke

The script does insert a BOM at the beginning of the output csv file. If I would just make good on my intent to acquire Office 2016 for Mac, I could test this output file compatibility with Excel. 😉

Reply

Answer 15

Hiroto

Level 5

7,461 points

May 7, 2016 2:45 AM in response to MattJayC

Hello

As I understand it, there're separate issues in you case.

1) file(1) command will report us-ascii if there're only characters in ASCII range because there's no way to tell the difference.

2) TextEdit application (with its plain text file encoding for opening file set to automatic) will open the file in an encoding as it sees fit, whilst it honours file's com.apple.TextEncoding extended attribute if present or file's UTF-8 BOM (Byte Order Mark) if present.

3) Provided python script prepends UTF-8 BOM to the content of file. Note that UTF-8 BOM is neither required nor recommended because UTF-8 encoding scheme defines ordered byte sequence and there's no notion of endian-ness and therefore no need for BOM and it's not recommended because of its ambiguity in status in the text.

cf.

http://www.unicode.org/versions/Unicode8.0.0/

http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf

The Unicode Standard Core Specification

- Conformance

- 3.10 Unicode Encoding Schemes

- D95

- the last paragraph

Here's sample script to demonstrate how UTF-8 BOM and/or com.apple.TextEncoding extended attribute will affect the file(1) and TextEdit.

#!/bin/bash # # It is assumed that TextEdit's preference is set as - # Plain Text File Encoding # opening file : Automatic # DIR=~/Desktop/temp mkdir -p "$DIR" && cd "$DIR" || exit # simple abc echo 'abc' > a.txt file -I a.txt # -> a.file: text/plain; charset=us-ascii open -e a.txt # -> TextEdit opens file in System's primary encoding (e.g., MacRoman) # with UTF-8 BOM signature iconv -t UTF-16 a.txt | iconv -f UTF-16BE -t UTF-8 > a1.txt # effectively prepend UTF-8 BOM signature file -I a1.txt # -> a1.txt: text/plain; charset=utf-8 open -e a1.txt # -> TextEdit opens file in UTF-8 (honouring the UTF-8 BOM signature) # with com.apple.TextEncoding extended attribute set to UTF-8 echo 'abc' > a2.txt xattr -w com.apple.TextEncoding "UTF-8;$((0x08000100))" a2.txt file -I a2.txt # -> a2.txt: text/plain; charset=us-ascii open -e a2.txt # -> TexdEdit opens file in UTF-8 (honouring the extended attribute)

---

In case, here's AppleScript script to prepend UTF-8 BOM to input file and yield output file as specified.

-- APPLESCRIPT set infile to (choose file)'s POSIX path set outfile to (choose file name default name "out.txt")'s POSIX path --set infile to "/path/to/in.txt" --set outfile to "/path/to/out.txt" do shell script "/bin/bash -s <<'EOF' - " & infile's quoted form & space & outfile's quoted form & " # effectively prepend UTF-8 BOM signature to $1 and output to $2 iconv -t UTF-16 \"$1\" | iconv -f UTF-16BE -t UTF-8 > \"$2\"" -- END OF APPLESCRIPT

And here's another one to set com.apple.TextEncoding extended attribute of input file to UTF-8.

-- APPLESCRIPT set infile to (choose file)'s POSIX path --set infile to "/path/to/in.txt" do shell script "xattr -w com.apple.TextEncoding \"UTF-8;$((0x08000100))\" " & infile's quoted form -- END OF APPLESCRIPT

Because of the reasons mentioned above, I recommend the latter.

Briefly tested under OS X 10.6.8.

Good luck,

H

Reply