Hello mingsai,
The result of the script will depend upon the dictionary in use, for different dictionary can use different XSL stylesheet to generate XML (XHTML) representation of its record entry. If XML data does not have the element and attributes the script expects, parser should fail and return Not-found result.
E.g., 小词典.dictionary will return the following XML data for query word = '我'
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<d:entry id="xcd09325cdd85ded7c7cab6332260857fbd" d:title="我">
<span class="syntax"><span d:pr="US">wǒ</span></span>
<h1>我</h1>
<div>I; me; my</div>
<h3>我</h3>
<div class="editEntry">
<a href="http://xiaocidian.com/a/index.php?word=%E6%88%91">Edit CC-CEDICT Entry</a>
</div>
</d:entry>
</body>
</html>
where the script is retrieving title as the value at XPath //d:entry/@d:title and pronunciation at //d:entry//span[@d:pr].
After all, you'd have to prepare the parsing logic in hanzi2pinyin script according to the XML structure of record entry of any given dictionary. There would be no universal XPath or regex pattern for this. If you wish, you may use the following script to obtain the XML data of an entry of specified dictionary.
#!/bin/bash
# dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary'
# dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
dictf='/Library/Dictionaries/小词典.dictionary'
# dictf='/Library/Dictionaries/小词典-繁体字.dictionary'
# dictf='/Library/Dictionaries/小词典-英语.dictionary'
# dictf='/Library/Dictionaries/CC-CEDICT.dictionary'
# CMD=/usr/local/bin/dictionary_record_data.rb
CMD=~/desktop/dictionary_record_data.rb
"$CMD" -d "$dictf" -c10 -o0 '我'
provided that you saved the following ruby script as dictionary_record_data.rb on desktop.
dictionary_record_data.rb
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w
# coding: utf-8
require 'optparse'
require 'osx/cocoa'
include OSX
# OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework' # [1]
while File.exist?(BSFILE = File.expand_path("~/desktop/DictionaryServices.#{rand(1e10)}.bridgesupport")) do end
Signal.trap("EXIT") { File.delete BSFILE if File.exist?(BSFILE) }
File.open(BSFILE, "w") { |f| f.print DATA.read }
OSX.load_bridge_support_file BSFILE # [2]
File.delete BSFILE if File.exist?(BSFILE)
# -----------------------------------------------------
# * some DictionaryServices functions (OS X 10.6.8)
#
# extern CFArrayRef DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long)
# unsigned long long method
# 0 = exact match
# 1 = forward match (prefix match)
# 2 = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc)
# >=3 = ? (exact match?)
#
# long long max_record_count
#
# extern CFStringRef DCSRecordCopyData (DCSRecordRef, long)
# long output_style
# 0 = XML XHTML <html> string
# 1 = XML XHTML <html> string
# 2 = XML XHTML <html> string
# 3 = plain text
# 4 = XML XHTML <text> string (single element)
# * corresponding to (?)
# Transform.xsl
# TransformApp.xsl
# TransformPanel.xsl
# TransformSimpleText.xsl
# TransformText.xsl
# -----------------------------------------------------
def dict(argv)
#
# argv = options query [query ...]
# -d, --dictionary DICTIONARY Dictionary file.
# -c, --count COUNT Max record count to retrieve (=10).
# -o, --output FORMAT Output format (=0).
# 0 = XML (XHTML) <html> string
# 1 = XML (XHTML) <html> string
# 2 = XML (XHTML) <html> string
# 3 = plain text
# 4 = XML (XHTML) <text> string
# -h, --help Display this help.
#
args = {
:dictf => nil,
:count => 10,
:output => 0,
}
op = OptionParser.new do|o|
o.banner = "Usage: #{File.basename($0)} options query [query ...]"
o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f|
args[:dictf] = f
end
o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i|
raise OptionParser::InvalidArgument, i unless i.to_i > 0
args[:count] = i.to_i
end
o.on('-o', '--output FORMAT', Integer, "Output format (=0).",
" 0 = XML (XHTML) <html> string",
" 1 = XML (XHTML) <html> string",
" 2 = XML (XHTML) <html> string",
" 3 = plain text",
" 4 = XML (XHTML) <text> string") do |i|
raise OptionParser::InvalidArgument, i unless [0, 1, 2, 3, 4].include?(i.to_i)
args[:output] = i.to_i
end
o.on( '-h', '--help', 'Display this help.' ) do
$stderr.puts o; exit 1
end
end
begin
op.parse!(argv)
rescue => ex
$stderr.puts "#{ex.class} : #{ex.message}"
$stderr.puts op.help(); exit 1
end
if argv.length == 0
$stderr.puts op.help(); exit 1
end
if (dctf = args[:dictf])
unless File.exists?(dctf)
$stderr.puts "No such dictionary: %s" % dctf
exit 1
end
url = NSURL.fileURLWithPath(dctf)
dcts = dcts.allObjects if (dcts = DCSCopyAvailableDictionaries()).is_a? NSSet # [5]
dct, = dcts.select { |d| DCSDictionaryGetURL(d).path == url.path }
unless dct
$stderr.puts "Failed to get dictionary for: %s" % dctf
exit 2
end
else
dct, = DCSGetActiveDictionaries()
unless dct
$stderr.puts "Failed to get the 1st active dictionary"
exit 2
end
end
max_count = args[:count]
ouput_format = args[:output]
argv.map {|a| a.to_ns }.each do |q| # [3]
rr = DCSCopyRecordsForSearchString(dct, q, 0, max_count)
unless rr
puts "Not found: %s" % q
next
end
rr.each do |r| # r = DCSRecordRef
data = DCSRecordCopyData(r, ouput_format)
puts data
end
end
#
# [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed.
# I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc,
# {??=qq} should have been {_CFRange=qq}
# {??=ii} should have been {_CFRange=ii}
# [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file.
# It now includes signatures for several undocumented functions as well.
# [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8).
#
end
dict(ARGV)
exit
# ---- test code begins ----
# dictf = '/Library/Dictionaries/Simplified Chinese - English.dictionary'
# dictf = '/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
dictf = '/Library/Dictionaries/小词典.dictionary'
# dictf = '/Library/Dictionaries/小词典-繁体字.dictionary'
# dictf = '/Library/Dictionaries/小词典-英语.dictionary'
# dictf = '/Library/Dictionaries/CC-CEDICT.dictionary'
argv = ['-d', dictf, '我']
dict(argv)
# ---- test code ends ----
__END__
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd">
<signatures version="0.9">
<function name="DCSCopyTextDefinition">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSGetTermRangeInString">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type64="q" type="l"></arg>
<retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval>
</function>
<function name="DCSDictionaryCreate">
<arg type="^{__CFURL=}"></arg>
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSGetActiveDictionaries">
<retval type="^{__CFArray=}"></retval>
</function>
<function name="DCSCopyAvailableDictionaries">
<retval type="^{__CFSet=}"></retval>
</function>
<function name="DCSGetDefaultDictionary">
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSGetDefaultThesaurus">
<retval type="^{__DCSDictionary=}"></retval>
</function>
<function name="DCSDictionaryGetURL">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFURL=}"></retval>
</function>
<function name="DCSDictionaryGetName">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSDictionaryGetIdentifier">
<arg type="^{__DCSDictionary=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSCopyRecordsForSearchString">
<arg type="^{__DCSDictionary=}"></arg>
<arg type="^{__CFString=}"></arg>
<arg type="l"></arg>
<arg type="l"></arg>
<retval type="^{__CFArray=}"></retval>
</function>
<function name="DCSRecordGetHeadword">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetString">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetRawHeadword">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetTitle">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetAnchor">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFString=}"></retval>
</function>
<function name="DCSRecordGetDataURL">
<arg type="^{__DCSRecord=}"></arg>
<retval type="^{__CFURL=}"></retval>
</function>
<function name="DCSRecordCopyData">
<arg type="^{__DCSRecord=}"></arg>
<arg type="l"></arg>
<retval type="^{__CFString=}"></retval>
</function>
</signatures>
All the best,
H