Want to highlight a helpful answer? Upvote!

Did someone help you, or did an answer or User Tip resolve your issue? Upvote by selecting the upvote arrow. Your feedback helps others! Learn more about when to upvote >

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Convert csv file from us-ascii to UTF-8

I'm just trying to convert a file over I run this through terminal


StudioA:~ StudioA$ file -I /Users/StudioA/Desktop/Mikey_WK37.csv

and the result is

/Users/StudioA/Desktop/Mikey_WK37.csv: text/plain; charset=us-ascii

So I run this

iconv -f US-ASCII -t UTF-8 /Users/StudioA/Desktop/Mikey_WK37.csv > /Users/StudioA/Desktop/new_file.csv

then when I check the new file it is still in us-ascii.

Where am I going wrong?

Mac Pro, OS X El Capitan (10.11.1)

Posted on May 5, 2016 6:29 AM

Reply
18 replies

May 5, 2016 1:53 PM in response to Tom Gewecke

If I create a file in text edit in plain text type 1 or whatever then save as 'Western (Mac OS Roman), then run file -I theFile it is us-ascii


This is how most of the files I get look, I need to use an extra character "✔" and when you add that file, it then asks for it to be saved in another code (Such as UTF-8)


Tom Gewecke wrote:


Does your csv file actually include a statement saying it is us-ascii? If so you could just change that to utf-8.

If I can just change it. How can I do that in terminal


Thanks


Matt

May 6, 2016 1:01 AM in response to MattJayC

Here is a Python program that reads in US-ASCII CSV and outputs it as a UTF-8 quoted field, Excel compatible CSV. The output opens and formats nicely in Numbers v3.6.1, and LibreOffice Calc v5.1.2.2.


Usage: ucsv.py input.csv output.csv


Copy and paste the following Python code into a programmer's editor. If it is Sublime Text 3, then use Paste and Indent. Otherwise, paste into a TextEdit plain text file, and save as ucvs.py. Make the Python script executable in the Terminal.


Test this on a small CSV and open it in a spreadsheet application to verify it works ok for you.


Code:

#!/usr/bin/env python
# coding: utf-8
'''
ucsv.py


Read in a US-ASCII CSV document and write out a quoted field Excel CSV.
Output CSV read correctly by Numbers v3.6.1, LibreOffice Calc 5.1.2.2.


Usage: ucsv.py us-ascii-input.csv utf8_output.csv
Derived from : https://docs.python.org/2.7/library/csv.html#examples
http://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7
'''


import csv
import codecs
import cStringIO
import os
import sys




class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)


    def __iter__(self):
        return self


    def next(self):
        return self.reader.next().encode("utf-8")




class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)


    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]


    def __iter__(self):
        return self




class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()


    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)


    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


if len(sys.argv) < 3:
    sys.exit("{} <ascii-csv> <utf-csv>\n".format(sys.argv[0]))


if os.path.exists(sys.argv[1]) and sys.argv[1].endswith('.csv'):
    ascii_csv = os.path.expanduser(sys.argv[1])
    utf8_csv = os.path.expanduser(sys.argv[2])
else:
    sys.exit("One or both of the input files do not exist.")


with open(ascii_csv, 'rb') as fin, open(utf8_csv, 'wb') as fout:
    reader = UnicodeReader(fin)
    writer = UnicodeWriter(fout, quoting=csv.QUOTE_ALL)
    for line in reader:
        writer.writerow(line)

May 6, 2016 7:07 AM in response to Tom Gewecke

Tom,


Numbers v3.6.1 (and LibreOffice Calc) always informs that the Export to CSV default encoding is UTF-8. If only US-ASCII characters are used in the spreadsheet, then the UNIX file utility will report the CSV as ASCII text with CRLF line terminators. If the spreadsheet contains US-ASCII and non-US-ASCII characters, then the UNIX file utility will identify the CSV as UTF-8 Unicode text with CRLF line terminators.


The Python script does some more things, but in essence, it takes a pure US-ASCII CSV, and re-encodes it to 8-bit ASCII characters that now inform the UNIX file utility to report the above UTF-8 message. Rather pointless.


In retrospect, and some discovery on my part after I posted the code, it is unnecessary, as the current Numbers and LibreOffice Calc are adapting the exported CSV encoding as the content demands.

May 7, 2016 2:45 AM in response to MattJayC

Hello


As I understand it, there're separate issues in you case.


1) file(1) command will report us-ascii if there're only characters in ASCII range because there's no way to tell the difference.


2) TextEdit application (with its plain text file encoding for opening file set to automatic) will open the file in an encoding as it sees fit, whilst it honours file's com.apple.TextEncoding extended attribute if present or file's UTF-8 BOM (Byte Order Mark) if present.


3) Provided python script prepends UTF-8 BOM to the content of file. Note that UTF-8 BOM is neither required nor recommended because UTF-8 encoding scheme defines ordered byte sequence and there's no notion of endian-ness and therefore no need for BOM and it's not recommended because of its ambiguity in status in the text.


cf.

http://www.unicode.org/versions/Unicode8.0.0/

http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf

The Unicode Standard Core Specification

- Conformance

- 3.10 Unicode Encoding Schemes

- D95

- the last paragraph




Here's sample script to demonstrate how UTF-8 BOM and/or com.apple.TextEncoding extended attribute will affect the file(1) and TextEdit.


#!/bin/bash # # It is assumed that TextEdit's preference is set as - # Plain Text File Encoding # opening file : Automatic # DIR=~/Desktop/temp mkdir -p "$DIR" && cd "$DIR" || exit # simple abc echo 'abc' > a.txt file -I a.txt # -> a.file: text/plain; charset=us-ascii open -e a.txt # -> TextEdit opens file in System's primary encoding (e.g., MacRoman) # with UTF-8 BOM signature iconv -t UTF-16 a.txt | iconv -f UTF-16BE -t UTF-8 > a1.txt # effectively prepend UTF-8 BOM signature file -I a1.txt # -> a1.txt: text/plain; charset=utf-8 open -e a1.txt # -> TextEdit opens file in UTF-8 (honouring the UTF-8 BOM signature) # with com.apple.TextEncoding extended attribute set to UTF-8 echo 'abc' > a2.txt xattr -w com.apple.TextEncoding "UTF-8;$((0x08000100))" a2.txt file -I a2.txt # -> a2.txt: text/plain; charset=us-ascii open -e a2.txt # -> TexdEdit opens file in UTF-8 (honouring the extended attribute)




---

In case, here's AppleScript script to prepend UTF-8 BOM to input file and yield output file as specified.



-- APPLESCRIPT set infile to (choose file)'s POSIX path set outfile to (choose file name default name "out.txt")'s POSIX path --set infile to "/path/to/in.txt" --set outfile to "/path/to/out.txt" do shell script "/bin/bash -s <<'EOF' - " & infile's quoted form & space & outfile's quoted form & " # effectively prepend UTF-8 BOM signature to $1 and output to $2 iconv -t UTF-16 \"$1\" | iconv -f UTF-16BE -t UTF-8 > \"$2\"" -- END OF APPLESCRIPT



And here's another one to set com.apple.TextEncoding extended attribute of input file to UTF-8.



-- APPLESCRIPT set infile to (choose file)'s POSIX path --set infile to "/path/to/in.txt" do shell script "xattr -w com.apple.TextEncoding \"UTF-8;$((0x08000100))\" " & infile's quoted form -- END OF APPLESCRIPT



Because of the reasons mentioned above, I recommend the latter.


Briefly tested under OS X 10.6.8.


Good luck,

H

Convert csv file from us-ascii to UTF-8

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.