Hello
I have managed to invoke some of the undocumented functions of DictionaryServices.framework. The shell script listed below, which is a wrapper of RubyCocoa script, will transliterate Hanzi to pinyin by means of looking up specified dictionary.
Key functions are DCSCopyRecordsForSearchString() and DCSRecordCopyData() which retrieve the structured representation (XHTML) of found entry that we can parse and extract the title and pronunciation cleanly.
A usage example is as follows, provided that you have saved the script as /usr/local/bin/hanzi2pinyin (and chmod a+x it). Please see the source code for the details of options. The result will depend upon the dictionary used. It will try to use the longest query substring starting from the beginning of the current query to match some term in dictionary. If such query substring exists, script will lookup it up in the dictionary, output the result and update the current query to the remaining substring; otherwise script will give up the first character of the current query, output the not-found result (specified by -e option) for the character and update the current query to the remaining substring. Note some word (character) have multiple readings and in which case script will output the primary match followed by additional matches in parentheses. Specify the max record count to retrieve by -c option. The additional matches would be noises when the primary match is correct but the primary match is not necessarily correct. If you wish, you may specify -c1 to suppress additional matches.
Usage e.g.
#!/bin/bash
# dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary'
# dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
# dictf='/Library/Dictionaries/小词典.dictionary'
# dictf='/Library/Dictionaries/小词典-繁体字.dictionary'
dictf='/Library/Dictionaries/CC-CEDICT.dictionary'
/usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o0 -e -- '悟空捣鬼花果山' # => 悟空[wùkōng] 捣鬼[dǎoguǐ] 花果山[huāguǒshān]
/usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o1 -e -- '悟空捣鬼花果山' # => 悟空 捣鬼 花果山[wùkōng dǎoguǐ huāguǒshān]
/usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o2 -e -- '悟空捣鬼花果山' # => wùkōng dǎoguǐ huāguǒshān
hanzi2pinyin
#!/bin/bash
hanzi2pinyin()
{
#
# $@ = options query [query ...]
# -d, --dictionary DICTIONARY Dictionary file.
# -c, --count COUNT Max record count to retrieve (=10).
# -o, --output FORMAT Output format (=0).
# 0 = interleaved : H[p] H[p]...
# 1 = separate : H H...[p p...]
# 2 = pinyin only : p p...
# -e, --echo [CHARACTER] Character(s) to be echoed for no result.
# Given no CHARACTER, query is echoed.
# -h, --help Display this help.
#
# v0.33
# written by Hiroto, 2014-09
#
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - <(cat <<'BRIDGESUPPORT'
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd">
<signatures version="0.9">
<function name="DCSCopyTextDefinition">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSGetTermRangeInString">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type64="q" type="l"></arg>
<retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval>
</function>
<function name="DCSDictionaryCreate">
<arg type="^{__CFURL=}"></arg>
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSGetActiveDictionaries">
<retval type="^{__CFArray=}"></retval>
</function>
<function name="DCSGetDefaultDictionary">
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSGetDefaultThesaurus">
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSDictionaryGetURL">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFURL=}"></retval>
</function>
<function name="DCSDictionaryGetName">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSDictionaryGetIdentifier">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSCopyRecordsForSearchString">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type="l"></arg>
<arg type="l"></arg>
<retval type="^{__CFArray=}"></retval>
</function>
<function name="DCSRecordGetHeadword">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetString">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetRawHeadword">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetTitle">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetAnchor">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetDataURL">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFURL=}"></retval>
</function>
<function name="DCSRecordCopyData">
<arg type="^{__DCSRecord=}"></arg>
<arg type="l"></arg>
<retval type="^{__CFString=}"></retval>
</function>
</signatures>
BRIDGESUPPORT) "$@"
# -----------------------------------------------------
# * some DictionaryServices functions
#
# (undocumented)
#
# extern CFArrayRef DCSGetActiveDictionaries (void)
# extern DCSDictionaryRef DCSGetDefaultDictionary (void)
# extern DCSDictionaryRef DCSGetDefaultThesaurus (void)
# extern DCSDictionaryRef DCSDictionaryCreate (CFURLRef)
# extern CFURLRef DCSDictionaryGetURL (DCSDictionaryRef)
# extern CFStringRef DCSDictionaryGetName (DCSDictionaryRef)
# extern CFStringRef DCSDictionaryGetIdentifier (DCSDictionaryRef)
#
# extern CFArray DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long)
# unsigned long long method
# 0 = exact match
# 1 = forward match (prefix match)
# 2 = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc)
# >=3 = ? (exact match?)
#
# long long max_record_count
#
# extern CFStringRef DCSRecordGetString (DCSRecordRef)
# extern CFStringRef DCSRecordGetHeadword (DCSRecordRef)
# extern CFStringRef DCSRecordGetRawHeadword (DCSRecordRef)
# extern CFStringRef DCSRecordGetTitle (DCSRecordRef)
# extern CFStringRef DCSRecordGetAnchor (DCSRecordRef)
# extern CFURLRef DCSRecordGetDataURL (DCSRecordRef)
#
# extern CFStringRef DCSRecordCopyData (DCSRecordRef, long)
# long output_style
# 0 = XML XHTML <html> string
# 1 = XML XHTML <html> string
# 2 = XML XHTML <html> string
# 3 = plain text
# 4 = XML XHTML <text> string (single element)
# * corresponding to (?)
# Transform.xsl
# TransformApp.xsl
# TransformPanel.xsl
# TransformSimpleText.xsl
# TransformText.xsl
#
# (documented)
#
# CFStringRef DCSCopyTextDefinition (DCSDictionaryRef, CFStringRef, CFRange)
# CFRange DCSGetTermRangeInString (DCSDictionaryRef, CFStringRef, CFIndex)
#
# -----------------------------------------------------
#
# ARGV[0] = fixed and extended bridge support file for DictionaryServices.framework
# ARGV[1..N] = options and query word(s)
#
require 'osx/cocoa'
include OSX
# OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework' # [1]
OSX.load_bridge_support_file ARGV.shift # [2]
def parse_options(argv)
require 'optparse'
args = {
:dictf => nil,
:count => 10,
:output => 0,
:echo => '',
}
op = OptionParser.new do|o|
o.banner = "Usage: #{File.basename($0)} options query [query ...]"
o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f|
args[:dictf] = f
end
o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i|
raise OptionParser::InvalidArgument, i unless i.to_i > 0
args[:count] = i.to_i
end
o.on('-o', '--output FORMAT', Integer, "Output format (=0).",
" 0 = interleaved : H[p] H[p]...",
" 1 = separate : H H...[p p...]",
" 2 = pinyin only : p p...") do |i|
raise OptionParser::InvalidArgument, i unless [0, 1, 2].include?(i.to_i)
args[:output] = i.to_i
end
o.on('-e', '--echo [CHARACTER]', String, "Character(s) to be echoed for no result.",
"Given no CHARACTER, query is echoed.") do |s|
args[:echo] = s || ''
end
o.on( '-h', '--help', 'Display this help.' ) do
$stderr.puts o; exit 1
end
end
begin
op.parse!(argv)
rescue => ex
$stderr.puts "#{ex.class} : #{ex.message}"
$stderr.puts op.help(); exit 1
end
if argv.length == 0
$stderr.puts op.help(); exit 1
end
args
end
args = parse_options(ARGV)
if (dctf = args[:dictf])
unless File.exists?(dctf)
$stderr.puts "No such dictionary: %s" % dctf
exit 1
end
url = NSURL.fileURLWithPath(dctf)
dct = DCSDictionaryCreate(url)
unless dct
$stderr.puts "Failed to create dictionary object from: %s" % dctf
exit 2
end
else
dct = DCSGetDefaultDictionary()
unless dct
$stderr.puts "Failed to obtain default dictionary"
exit 2
end
end
QUERY_METHOD = 0 # exact match
MAX_RECORD_COUNT = args[:count] # max record count to be retrieved
OUTPUT_FORMAT = args[:output] # output format option
# 0 = interleaved : H[p] H[p]...
# 1 = separate : H H...[p p...]
# 2 = pinyin only : p p...
#
# e.g., given query '我的母亲'
# 0 => 我[wǒ] 的[de(dī,dí,dì)] 母亲[mǔqīn]
# 1 => 我 的 母亲[wǒ de(dī,dí,dì) mǔqīn]
# 2 => wǒ de(dī,dí,dì) mǔqīn
TRIM_CHARS = "\t\n |" # characters to be trimmed at both ends of pronunciation string
ECHO_QUERY = '' # special character to let it echo query if result is not found
ECHO_CHAR = args[:echo] # character(s) to be echoed if no result is found for query
# if ECHO_QUERY is specified, query string is echoed for no result
TRIM_CHARS_SET = NSCharacterSet.characterSetWithCharactersInString(TRIM_CHARS)
ECHO_NS = ECHO_CHAR.to_ns
ARGV.map {|a| a.to_ns }.each do |q| # [3]
dd = []
while true do
#
# Until given query string (q) is exhausted, repeat as follows -
# get longest leading substring (qu) of the query string matching a term in dictionary,
# look the substring up in dictionary and retrieve title and pronunciation of the matching entry.
#
u = DCSGetTermRangeInString(dct, q, 0) # try to find longest leading range matching a term in dictionary
u = NSMakeRange(0, 1) if u.location == KCFNotFound # fallback [4]
qu = q.substringWithRange(u)
rr = DCSCopyRecordsForSearchString(dct, qu, QUERY_METHOD, MAX_RECORD_COUNT)
unless rr
c = q.substringWithRange(NSMakeRange(0, 1)) # give up one character at the beginning
dd << [[c, ECHO_CHAR == ECHO_QUERY ? c : ECHO_NS]]
break if q.length < 2
q = q.substringFromIndex(1)
else
tt, pp = [], {}
rr.each do |r| # r = DCSRecordRef
#
# parse xml representation of record entry to get title and pronunciation
#
xml = DCSRecordCopyData(r, 0)
err = OCObject.new
doc = NSXMLDocument.alloc.objc_send(
:initWithXMLString, xml,
:options, 0,
:error, err)
unless doc
$stderr.puts "Failed to obtain XML document for %s: %s" % [qu, err.description]
next
end
nn = doc.objc_send(
:nodesForXPath, '//d:entry/@d:title', # d:title attribute
:error, nil)
title = nn && nn == [] ? ECHO_NS : nn.first.stringValue
nn = doc.objc_send(
:nodesForXPath, '//d:entry//span[@d:pr]', # span element with d:pr attribute
:error, nil)
pron = nn && nn == [] ? ECHO_NS : nn.first.stringValue
pron = pron.stringByTrimmingCharactersInSet(TRIM_CHARS_SET).
stringByReplacingOccurrencesOfString_withString(' ', '').lowercaseString
tt << title unless tt.include?(title)
title_s = title.to_s # for use as hash key in ruby
if not pp.key?(title_s)
pp[title_s] = [pron]
elsif not pp[title_s].include?(pron)
pp[title_s] << pron
end
end
#
# Let query_{k} denote sub-query for k-th substring defined by range u,
# title_{k,i} denote i-th found title for query_{k},
# pron_{k,i,j} denote j-th pronunciation for title_{k,i};
#
# array cc_k holds each collection of pronunciations per tile_{k,i} found for query_{k}:
# cc_k = [c_{k,1}, c_{k,2}, ...]
# c_{k,i} = [ title_{k,i}, pron_{k,i,1} *1( '(' pron_{k,i,2} ',' pron_{k,i,3} ',' ... ')' ) ]
#
# array dd holds list of cc_k for every sub-query_{k}
# dd = [cc_1, cc_2, ...]
#
cc_k = tt.map do |t|
a = pp[t.to_s]
[t, a.shift + (a == [] ? '' : "(%s)" % a.join(','))]
end
dd << cc_k
k = u.location + u.length
break unless k < q.length
q = q.substringFromIndex(k)
end
end
case OUTPUT_FORMAT
# 0 = interleaved : H[p] H[p]...
# 1 = separate : H H...[p p...]
# 2 = pinyin only : p p...
when 0
ee = dd.map do |cc|
next '' if cc == []
("%s[%s]" % cc.shift) + (cc == [] ? '' : "(%s)" % cc.map {|c| "%s[%s]" % c}.join(','))
end
puts ee.join(' ')
when 1
aa = dd.map do |cc|
a, b = cc.transpose
next '' unless a
(a.shift) + (a == [] ? '' : "(%s)" % a.join(','))
end
bb = dd.map do |cc|
a, b = cc.transpose
next '' unless b
(b.shift) + (b == [] ? '' : "(%s)" % b.join(','))
end
puts "%s[%s]" % [aa.join(' '), bb.join(' ')]
when 2
bb = dd.map do |cc|
a, b = cc.transpose
next '' unless b
(b.shift) + (b == [] ? '' : "(%s)" % b.join(','))
end
puts bb.join(' ')
end
end
#
# [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed.
# I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc,
# {??=qq} should have been {_CFRange=qq}
# {??=ii} should have been {_CFRange=ii}
# [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file.
# It now includes signatures for several undocumented functions as well.
# [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8).
# [4] DCSGetTermRangeInString(dct, q, 0) returning range [KCFNotFound, 0] does not necessarily mean q's 1st character
# as query may not match any term in dictionary. It is necessary to use DCSCopyRecordsForSearchString()
# for the 1st character in order to know the (existence of) matching term(s).
#
EOF
}
hanzi2pinyin "$@"
Tested under 10.6.8. As I used undocumented functions, they may have changed in later versions, which will break the script.
Hope this may help,
H