Hello
As I understand it, there're separate issues in you case.
1) file(1) command will report us-ascii if there're only characters in ASCII range because there's no way to tell the difference.
2) TextEdit application (with its plain text file encoding for opening file set to automatic) will open the file in an encoding as it sees fit, whilst it honours file's com.apple.TextEncoding extended attribute if present or file's UTF-8 BOM (Byte Order Mark) if present.
3) Provided python script prepends UTF-8 BOM to the content of file. Note that UTF-8 BOM is neither required nor recommended because UTF-8 encoding scheme defines ordered byte sequence and there's no notion of endian-ness and therefore no need for BOM and it's not recommended because of its ambiguity in status in the text.
cf.
http://www.unicode.org/versions/Unicode8.0.0/
http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf
The Unicode Standard Core Specification
- Conformance
- 3.10 Unicode Encoding Schemes
- D95
- the last paragraph
Here's sample script to demonstrate how UTF-8 BOM and/or com.apple.TextEncoding extended attribute will affect the file(1) and TextEdit.
#!/bin/bash
#
# It is assumed that TextEdit's preference is set as -
# Plain Text File Encoding
# opening file : Automatic
#
DIR=~/Desktop/temp
mkdir -p "$DIR" && cd "$DIR" || exit
# simple abc
echo 'abc' > a.txt
file -I a.txt # -> a.file: text/plain; charset=us-ascii
open -e a.txt # -> TextEdit opens file in System's primary encoding (e.g., MacRoman)
# with UTF-8 BOM signature
iconv -t UTF-16 a.txt | iconv -f UTF-16BE -t UTF-8 > a1.txt # effectively prepend UTF-8 BOM signature
file -I a1.txt # -> a1.txt: text/plain; charset=utf-8
open -e a1.txt # -> TextEdit opens file in UTF-8 (honouring the UTF-8 BOM signature)
# with com.apple.TextEncoding extended attribute set to UTF-8
echo 'abc' > a2.txt
xattr -w com.apple.TextEncoding "UTF-8;$((0x08000100))" a2.txt
file -I a2.txt # -> a2.txt: text/plain; charset=us-ascii
open -e a2.txt # -> TexdEdit opens file in UTF-8 (honouring the extended attribute)
---
In case, here's AppleScript script to prepend UTF-8 BOM to input file and yield output file as specified.
-- APPLESCRIPT
set infile to (choose file)'s POSIX path
set outfile to (choose file name default name "out.txt")'s POSIX path
--set infile to "/path/to/in.txt"
--set outfile to "/path/to/out.txt"
do shell script "/bin/bash -s <<'EOF' - " & infile's quoted form & space & outfile's quoted form & "
# effectively prepend UTF-8 BOM signature to $1 and output to $2
iconv -t UTF-16 \"$1\" | iconv -f UTF-16BE -t UTF-8 > \"$2\""
-- END OF APPLESCRIPT
And here's another one to set com.apple.TextEncoding extended attribute of input file to UTF-8.
-- APPLESCRIPT
set infile to (choose file)'s POSIX path
--set infile to "/path/to/in.txt"
do shell script "xattr -w com.apple.TextEncoding \"UTF-8;$((0x08000100))\" " & infile's quoted form
-- END OF APPLESCRIPT
Because of the reasons mentioned above, I recommend the latter.
Briefly tested under OS X 10.6.8.
Good luck,
H