MattJayC

Q: Convert csv file from us-ascii to UTF-8

I'm just trying to convert a file over I run this through terminal

 

StudioA:~ StudioA$ file -I /Users/StudioA/Desktop/Mikey_WK37.csv 

and the result is

/Users/StudioA/Desktop/Mikey_WK37.csv: text/plain; charset=us-ascii


So I run this

iconv -f US-ASCII -t UTF-8 /Users/StudioA/Desktop/Mikey_WK37.csv > /Users/StudioA/Desktop/new_file.csv

then when I check the new file it is still in us-ascii.


Where am I going wrong?

Mac Pro, OS X El Capitan (10.11.1)

Posted on May 5, 2016 6:29 AM

Close

Q: Convert csv file from us-ascii to UTF-8

  • All replies
  • Helpful answers

Previous Page 2
  • by Hiroto,

    Hiroto Hiroto May 7, 2016 2:45 AM in response to MattJayC
    Level 5 (7,281 points)
    May 7, 2016 2:45 AM in response to MattJayC

    Hello

     

    As I understand it, there're separate issues in you case.

     

    1) file(1) command will report us-ascii if there're only characters in ASCII range because there's no way to tell the difference.

     

    2) TextEdit application (with its plain text file encoding for opening file set to automatic) will open the file in an encoding as it sees fit, whilst it honours file's com.apple.TextEncoding extended attribute if present or file's UTF-8 BOM (Byte Order Mark) if present.

     

    3) Provided python script prepends UTF-8 BOM to the content of file. Note that UTF-8 BOM is neither required nor recommended because UTF-8 encoding scheme defines ordered byte sequence and there's no notion of endian-ness and therefore no need for BOM and it's not recommended because of its ambiguity in status in the text.

     

    cf.

    http://www.unicode.org/versions/Unicode8.0.0/

    http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf

    The Unicode Standard Core Specification

        - Conformance

            - 3.10 Unicode Encoding Schemes

                - D95

                - the last paragraph

     

     

     

    Here's sample script to demonstrate how UTF-8 BOM and/or com.apple.TextEncoding extended attribute will affect the file(1) and TextEdit.

     

    #!/bin/bash
    # 
    #   It is assumed that TextEdit's preference is set as -
    #       Plain Text File Encoding
    #           opening file : Automatic
    # 
    DIR=~/Desktop/temp
    mkdir -p "$DIR" && cd "$DIR" || exit
    
    # simple abc
    echo 'abc' > a.txt
    file -I a.txt       # -> a.file: text/plain; charset=us-ascii
    open -e a.txt       # -> TextEdit opens file in System's primary encoding (e.g., MacRoman)
    
    
    # with UTF-8 BOM signature
    iconv -t UTF-16 a.txt | iconv -f UTF-16BE -t UTF-8 > a1.txt     # effectively prepend UTF-8 BOM signature
    file -I a1.txt      # -> a1.txt: text/plain; charset=utf-8
    open -e a1.txt      # -> TextEdit opens file in UTF-8 (honouring the UTF-8 BOM signature)
    
    
    # with com.apple.TextEncoding extended attribute set to UTF-8
    echo 'abc' > a2.txt
    xattr -w com.apple.TextEncoding "UTF-8;$((0x08000100))" a2.txt
    file -I a2.txt      # -> a2.txt: text/plain; charset=us-ascii
    open -e a2.txt      # -> TexdEdit opens file in UTF-8 (honouring the extended attribute)
    

     

     

     

    ---

    In case, here's AppleScript script to prepend UTF-8 BOM to input file and yield output file as specified.

     

     

    -- APPLESCRIPT
    set infile to (choose file)'s POSIX path
    set outfile to (choose file name default name "out.txt")'s POSIX path
    --set infile to "/path/to/in.txt"
    --set outfile to "/path/to/out.txt"
    do shell script "/bin/bash -s <<'EOF' - " & infile's quoted form & space & outfile's quoted form & "
    # effectively prepend UTF-8 BOM signature to $1 and output to $2
    iconv -t UTF-16 \"$1\" | iconv -f UTF-16BE -t UTF-8 > \"$2\""
    -- END OF APPLESCRIPT
    

     

     

    And here's another one to set com.apple.TextEncoding extended attribute of input file to UTF-8.

     

     

    -- APPLESCRIPT
    set infile to (choose file)'s POSIX path
    --set infile to "/path/to/in.txt"
    do shell script "xattr -w com.apple.TextEncoding \"UTF-8;$((0x08000100))\" " & infile's quoted form
    -- END OF APPLESCRIPT
    

     

     

    Because of the reasons mentioned above, I recommend the latter.

     

    Briefly tested under OS X 10.6.8.

     

    Good luck,

    H

  • by MattJayC,

    MattJayC MattJayC May 11, 2016 3:28 AM in response to Hiroto
    Level 1 (5 points)
    Mac OS X
    May 11, 2016 3:28 AM in response to Hiroto

    Hi Hiroto

     

    I am finding that the first one works for me.

     

    You helped me with a script in the past, i've made some modifications to the script, but it doesn't always work correctly, So i need some further help with it, I will attempt to post to the original post.

     

    Thanks

    Matt

  • by MattJayC,

    MattJayC MattJayC Jun 2, 2016 7:53 AM in response to Hiroto
    Level 1 (5 points)
    Mac OS X
    Jun 2, 2016 7:53 AM in response to Hiroto

     

     

    -- APPLESCRIPT
    set infile to (choose file)'s POSIX path
    set outfile to (choose file name default name "out.txt")'s POSIX path
    --set infile to "/path/to/in.txt"
    --set outfile to "/path/to/out.txt"
    do shell script "/bin/bash -s <<'EOF' - " & infile's quoted form & space & outfile's quoted form & "
    # effectively prepend UTF-8 BOM signature to $1 and output to $2
    iconv -t UTF-16 \"$1\" | iconv -f UTF-16BE -t UTF-8 > \"$2\""
    -- END OF APPLESCRIPT
    

     

    I'm using this, how can I write it back to the same file or create a file of the same name. I don't need to keep the original.

     

    Matt

  • by Hiroto,

    Hiroto Hiroto Jun 2, 2016 11:47 PM in response to MattJayC
    Level 5 (7,281 points)
    Jun 2, 2016 11:47 PM in response to MattJayC

    Hello

     

    You may try something like this.

     

     

    set f to (choose file)'s POSIX path
    --set f to "/path/to/a.txt"
    add_utf8_bom(f)
    
    on add_utf8_bom(f)
        (*
            string f : POSIX path of text file in UTF-8
        *)
        do shell script "file=" & f's quoted form & "
    [[ $(head -c3 \"$file\") == $'\\xef\\xbb\\xbf' ]] && exit # UTF-8 BOM already exists
    temp=$(mktemp /tmp/iconv.XXXXXX) || exit
    iconv -f UTF-8 -t UTF-16 \"$file\" > \"$temp\" &&
    iconv -f UTF-16BE -t UTF-8 \"$temp\" > \"$file\"
    err=$?
    rm \"$temp\"
    exit $err"
    end add_utf8_bom
    

     

     

     

    Briefly tested under OS X 10.6.8.

     

    Regards,

    H

Previous Page 2