You can make a difference in the Apple Support Community!

When you sign up with your Apple Account, you can provide valuable feedback to other community members by upvoting helpful replies and User Tips.

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

mingsai Author

Level 1

30 points

How can I use Automator to extract substring of text based on pattern?

I have inbound text in a workflow and I want to extract a substring from the text.

(inbound text)

我 wǒ 代 ① （指一人）（用作主语） I （用作宾语） me （表所属关系） my 告诉我 tell me 我为人人，人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行！ I think I can manage it. ② （指两人或以上）（用作主语） we （用作宾语） us （表所属关系） our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ （表泛指） [used together with 你 in parallel structures] anyone 大家你一言，我一语，献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ （指自我） self → 忘我, 自我

I basically only want the non-double byte characters between after the first character and the first occurrence of ① : (see sample)

我 wǒ 代 ①

Using regex101.com I have been able to determine that this regex pattern should produce the required results but I need help getting these results into automator:

/ ([a-z].) /

MacBook Air (13-inch Mid 2012), Mac OS X (10.7.5), Love the mac (OSX 10.9+)

Posted on Sep 9, 2014 2:59 PM

Top-ranking reply

SGIII

Level 8

36,134 points

Posted on Sep 9, 2014 7:24 PM

If your pinyin substring never has spaces in it, then you can use AppleScript to extract it like this:

This is the script (with the sample input). Copy/paste into AppleScript Editor and click the green triangle 'Run' button:

set input to "我 wǒ 代 ① （指一人）（用作主语） I （用作宾语） me （表所属关系） my 告诉我 tell me 我为人人，人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行！ I think I can manage it. ② （指两人或以上）（用作主语） we （用作宾语） us （表所属关系） our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ （表泛指） [used together with 你 in parallel structures] anyone 大家你一言，我一语，献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ （指自我） self → 忘我, 自我"

set pyCharSet to {"a", "ā", "á", "ǎ", "à", "b", "c", "d", "e", "ē", "é", "ě", "è", "f", "g", "h", "i", "ī", "í", "ǐ", "ì", "j", "k", "l", "m", "n", "o", "ō", "ó", "ǒ", "ò", "p", "q", "r", "s", "t", "u", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ", "w", "x", "y", "z"}

set {oTID, AppleScript'stext item delimiters} to {AppleScript'stext item delimiters, "①"}

set cc to (input as string)'s text items's item 1's characters 2 thru -1

set py to ""

repeat with c in cc

if c is in pyCharSet then set py to py & c

end repeat

set AppleScript'stext item delimiters to oTID

return py

The above view is AppleScript Editor, not Automator. What do you plan to do with the results in Automator?

View in context

51 replies

SGIII

Level 8

36,134 points

Sep 11, 2014 2:31 PM in response to mingsai

This may be close to what you're looking for. In my (limited) testing it returns the first occurrence of what looks to be pinyin, though it also returns false positives if the definition were to include an "English-looking" word before the pinyin.

set ww to input'sparagraphs'sitem 1's words

repeat with i from 1 to count ww

set w to ww's item i

if w's every character's item 1 is in pyCharSet then return w

end repeat

Am experimenting with the 'considering diacriticals' AppleScript statement to see if I can get it to only match words with accented characters and then, if it doesn't, to check a list of common neutral tones ("de", "zhe", "le", "ba", "ma", "ne", "a", "lai", "ge", etc). So far no joy, though I do seem to have better luck with AppleScript than with shell scripts in avoiding spurious results with Chinese.

SGIII

Level 8

36,134 points

Sep 11, 2014 7:47 PM in response to mingsai

This is closer. It returns the first instance of a pinyin word containing a diacritical (tone mark), ignoring any romanized words before that, unless they are in a set of common neutral tone syllables.

It could be modified to extract all instances of pinyin in a text.

set pinyinDiacriticals to {"ā", "á", "ǎ", "à", "ē", "é", "ě", "è", "ī", "í", "ǐ", "ì", "ō", "ó", "ǒ", "ò", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ"}

set neutralToneWords to {"a", "ba", "bian", "de", "di", "ge", "jie", "lai", "le", "ma", "ne", "tou", "yi", "zhe"}

set tt to input'sparagraphs'scontents as text-- merge paragraphs

set ww to tt'swords--split the text into words

repeat with i from 1 to count ww

set w to ww's item i

set cc to w'scharacters-- split a word into characters

repeat with c in cc

if c is in pinyinDiacriticals then return w

end repeat

if w is in neutralToneWords then return w

end repeat

mingsai Author

Level 1

30 points

Sep 12, 2014 8:28 PM in response to SGIII

SGIII,

This is very close indeed. Here's an idea - How about reversing the logic related to the neutralToneWords so that a search is if w is not in chineseLanguageCharacterSet then return w. Would be great if applescript could exclude all Chinese characters in this way. I've been examining the Character Sets available in MacOSX using the Character Viewer. I've noted that the system has the ability to categorize Hanzi very neatly. One possible solution would be to exclude the entire set and take the first non-Chinese word containing letters.

BTW - I was able to extend my Character Viewer character sets by clicking the settings gear and choosing Customize List...

SGIII

Level 8

36,134 points

Sep 13, 2014 4:38 PM in response to mingsai

How about reversing the logic related to the neutralToneWords so that a search is if w is not in chineseLanguageCharacterSet then return w. ... One possible solution would be to exclude the entire set and take the first non-Chinese word containing letters.

This approach turns out to be pretty easy in AppleScript. The isCJK() handler in the following script shows one way to test whether a character is Chinese (or Korean or Japanese). In AppleScript a character's id property is the Unicode number expressed in decimal. So I looked up the relevant blocks in the Unicode charts and converted the hex values there to decimal.

set tt to input'sparagraphs'scontents as text-- merge paragraphs

set ww to tt'swords--split the text into words

repeat with i from 1 to count ww

set w to ww's item i

set cc to w'scharacters-- split a word into characters

repeat with c in cc

if not isCJK(c) then return w

end repeat

on isCJK(c)

-- if working with rare characters, first uncomment A. Unlikely to need B!

-- c's id is greater than 131071 and c's id is less than 173783 -- extension B (really rare)

-- c's id is greater than 13311 and c's id is less than 19894 -- 3400-4DB5 - extension A (rare)

c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

end isCJK

The A and B extensions probably aren't necessary in most situations (except for historians) so I commented them out. If needed, one would uncomment the A extension first, and finally B if needed.

Unlike the previous script, this variation grabs the first thing that isn't CJK, whether it's pinyin or something else.

mingsai Author

Level 1

30 points

Sep 13, 2014 11:12 PM in response to SGIII

Masterful. I will test and advise on utility among the various different dictionaries.

Sep 14, 2014 4:54 AM in response to mingsai

I realized that my regular expression was tuned to wŏ, but not to multi-tonal (if that is an applicable expression) words such as tiger 老虎 lǎohǔ. I then came across the following post that shows how to construct an elaborate regular expression for detecting hanzi. Would this be of value to you?

SGIII

Level 8

36,134 points

Sep 14, 2014 8:44 AM in response to VikingOSX

Great link. Have saved that one. Complicated stuff!

Made a mistake in the AppleScript handler. If testing for rare characters too, better would be:

on isCJK(c)

if c's id is greater than 131071 and c's id is less than 173783 then return true -- extension B (really rare)

if c's id is greater than 13311 and c's id is less than 19894 then return true -- 3400-4DB5 - extension A (rare)

c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

end isCJK

In normal usage, could comment out the two 'if' statements.

mingsai Author

Level 1

30 points

Sep 14, 2014 12:05 PM in response to VikingOSX

That link was a really good find.

Given the regex I was able to put several tests into regex101.com:

\s+[^\s]+(?![p{InCJKUnifiedIdeographs}p{InCJKUnifiedIdeographsExtensionA}p{InCJK CompatibilityIdeographs}x{30FB}x{FF0C}x{3007}x{FF21}x{FF3A}[0-9]])

This is the logic we are translating into applescript.

Here is an example where the expression breaks down: It would be much better to get a count of the number of chinese words at the beginning of the input and use that as a variable in the regex to iterate the correct number of non-CJK characters.

Here we can see that when the non-CJK follows immediately, the reqex works well.

It even works when the pinyin falls at the end of the line.

mingsai Author

Level 1

30 points

Sep 14, 2014 12:07 PM in response to SGIII

If the last statement were made into:

if c... then return true

else

return false

The code paths would all return a valid boolean.

SGIII

Level 8

36,134 points

Sep 14, 2014 4:59 PM in response to mingsai

They each return a valid boolean as written, including:

c's id is greater than 19967 and c's id is less than 40909

You need 'return' in the first two tests to exit the handler if the boolean is true.

But you don't need to it in the last one. Whether it is true or false it will exit.

SGIII

Level 8

36,134 points

Sep 14, 2014 5:28 PM in response to mingsai

This is an AppleScript way to count the Chinese characters at the beginning of input:

set tt to input'sparagraphs'scontents as text-- merge paragraphs

set ww to tt'swords-- split the text into words

set CJKctr to 0

repeat with i from 1 to count ww

set w to ww's item i

set cc to w'scharacters-- split a word into characters

repeat with c in cc

if isCJK(c) then

set CJKctr to CJKctr + 1

else

return CJKctr

end if

end repeat

on isCJK(c)

--if c's id is greater than 131071 and c's id is less than 173783 then return true -- extension B (really rare)

--if c's id is greater than 13311 and c's id is less than 19894 then return true -- 3400-4DB5 - extension A (rare)

c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

end isCJK

Hiroto

Level 5

7,461 points

Sep 16, 2014 1:38 PM in response to mingsai

Hello

I have managed to invoke some of the undocumented functions of DictionaryServices.framework. The shell script listed below, which is a wrapper of RubyCocoa script, will transliterate Hanzi to pinyin by means of looking up specified dictionary.

Key functions are DCSCopyRecordsForSearchString() and DCSRecordCopyData() which retrieve the structured representation (XHTML) of found entry that we can parse and extract the title and pronunciation cleanly.

A usage example is as follows, provided that you have saved the script as /usr/local/bin/hanzi2pinyin (and chmod a+x it). Please see the source code for the details of options. The result will depend upon the dictionary used. It will try to use the longest query substring starting from the beginning of the current query to match some term in dictionary. If such query substring exists, script will lookup it up in the dictionary, output the result and update the current query to the remaining substring; otherwise script will give up the first character of the current query, output the not-found result (specified by -e option) for the character and update the current query to the remaining substring. Note some word (character) have multiple readings and in which case script will output the primary match followed by additional matches in parentheses. Specify the max record count to retrieve by -c option. The additional matches would be noises when the primary match is correct but the primary match is not necessarily correct. If you wish, you may specify -c1 to suppress additional matches.

Usage e.g.

#!/bin/bash # dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary' # dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary' # dictf='/Library/Dictionaries/小词典.dictionary' # dictf='/Library/Dictionaries/小词典－繁体字.dictionary' dictf='/Library/Dictionaries/CC-CEDICT.dictionary' /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o0 -e -- '悟空捣鬼花果山' # => 悟空[wùkōng] 捣鬼[dǎoguǐ] 花果山[huāguǒshān] /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o1 -e -- '悟空捣鬼花果山' # => 悟空捣鬼花果山[wùkōng dǎoguǐ huāguǒshān] /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o2 -e -- '悟空捣鬼花果山' # => wùkōng dǎoguǐ huāguǒshān

hanzi2pinyin

#!/bin/bash hanzi2pinyin() { # # $@ = options query [query ...] # -d, --dictionary DICTIONARY Dictionary file. # -c, --count COUNT Max record count to retrieve (=10). # -o, --output FORMAT Output format (=0). # 0 = interleaved : H[p] H[p]... # 1 = separate : H H...[p p...] # 2 = pinyin only : p p... # -e, --echo [CHARACTER] Character(s) to be echoed for no result. # Given no CHARACTER, query is echoed. # -h, --help Display this help. # # v0.33 # written by Hiroto, 2014-09 # /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - <(cat <<'BRIDGESUPPORT' <?xml version="1.0" standalone="yes"?> <!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd"> <signatures version="0.9"> <function name="DCSCopyTextDefinition"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSGetTermRangeInString"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type64="q" type="l"></arg> <retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval> </function> <function name="DCSDictionaryCreate"> <arg type="^{__CFURL=}"></arg> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSGetActiveDictionaries"> <retval type="^{__CFArray=}"></retval> </function> <function name="DCSGetDefaultDictionary"> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSGetDefaultThesaurus"> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSDictionaryGetURL"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFURL=}"></retval> </function> <function name="DCSDictionaryGetName"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSDictionaryGetIdentifier"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSCopyRecordsForSearchString"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type="l"></arg> <arg type="l"></arg> <retval type="^{__CFArray=}"></retval> </function> <function name="DCSRecordGetHeadword"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetString"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetRawHeadword"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetTitle"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetAnchor"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetDataURL"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFURL=}"></retval> </function> <function name="DCSRecordCopyData"> <arg type="^{__DCSRecord=}"></arg> <arg type="l"></arg> <retval type="^{__CFString=}"></retval> </function> </signatures> BRIDGESUPPORT) "$@" # ----------------------------------------------------- # * some DictionaryServices functions # # (undocumented) # # extern CFArrayRef DCSGetActiveDictionaries (void) # extern DCSDictionaryRef DCSGetDefaultDictionary (void) # extern DCSDictionaryRef DCSGetDefaultThesaurus (void) # extern DCSDictionaryRef DCSDictionaryCreate (CFURLRef) # extern CFURLRef DCSDictionaryGetURL (DCSDictionaryRef) # extern CFStringRef DCSDictionaryGetName (DCSDictionaryRef) # extern CFStringRef DCSDictionaryGetIdentifier (DCSDictionaryRef) # # extern CFArray DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long) # unsigned long long method # 0 = exact match # 1 = forward match (prefix match) # 2 = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc) # >=3 = ? (exact match?) # # long long max_record_count # # extern CFStringRef DCSRecordGetString (DCSRecordRef) # extern CFStringRef DCSRecordGetHeadword (DCSRecordRef) # extern CFStringRef DCSRecordGetRawHeadword (DCSRecordRef) # extern CFStringRef DCSRecordGetTitle (DCSRecordRef) # extern CFStringRef DCSRecordGetAnchor (DCSRecordRef) # extern CFURLRef DCSRecordGetDataURL (DCSRecordRef) # # extern CFStringRef DCSRecordCopyData (DCSRecordRef, long) # long output_style # 0 = XML XHTML <html> string # 1 = XML XHTML <html> string # 2 = XML XHTML <html> string # 3 = plain text # 4 = XML XHTML <text> string (single element) # * corresponding to (?) # Transform.xsl # TransformApp.xsl # TransformPanel.xsl # TransformSimpleText.xsl # TransformText.xsl # # (documented) # # CFStringRef DCSCopyTextDefinition (DCSDictionaryRef, CFStringRef, CFRange) # CFRange DCSGetTermRangeInString (DCSDictionaryRef, CFStringRef, CFIndex) # # ----------------------------------------------------- # # ARGV[0] = fixed and extended bridge support file for DictionaryServices.framework # ARGV[1..N] = options and query word(s) # require 'osx/cocoa' include OSX # OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework' # [1] OSX.load_bridge_support_file ARGV.shift # [2] def parse_options(argv) require 'optparse' args = { :dictf => nil, :count => 10, :output => 0, :echo => '', } op = OptionParser.new do|o| o.banner = "Usage: #{File.basename($0)} options query [query ...]" o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f| args[:dictf] = f end o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i| raise OptionParser::InvalidArgument, i unless i.to_i > 0 args[:count] = i.to_i end o.on('-o', '--output FORMAT', Integer, "Output format (=0).", " 0 = interleaved : H[p] H[p]...", " 1 = separate : H H...[p p...]", " 2 = pinyin only : p p...") do |i| raise OptionParser::InvalidArgument, i unless [0, 1, 2].include?(i.to_i) args[:output] = i.to_i end o.on('-e', '--echo [CHARACTER]', String, "Character(s) to be echoed for no result.", "Given no CHARACTER, query is echoed.") do |s| args[:echo] = s || '' end o.on( '-h', '--help', 'Display this help.' ) do $stderr.puts o; exit 1 end end begin op.parse!(argv) rescue => ex $stderr.puts "#{ex.class} : #{ex.message}" $stderr.puts op.help(); exit 1 end if argv.length == 0 $stderr.puts op.help(); exit 1 end args end args = parse_options(ARGV) if (dctf = args[:dictf]) unless File.exists?(dctf) $stderr.puts "No such dictionary: %s" % dctf exit 1 end url = NSURL.fileURLWithPath(dctf) dct = DCSDictionaryCreate(url) unless dct $stderr.puts "Failed to create dictionary object from: %s" % dctf exit 2 end else dct = DCSGetDefaultDictionary() unless dct $stderr.puts "Failed to obtain default dictionary" exit 2 end end QUERY_METHOD = 0 # exact match MAX_RECORD_COUNT = args[:count] # max record count to be retrieved OUTPUT_FORMAT = args[:output] # output format option # 0 = interleaved : H[p] H[p]... # 1 = separate : H H...[p p...] # 2 = pinyin only : p p... # # e.g., given query '我的母亲' # 0 => 我[wǒ] 的[de(dī,dí,dì)] 母亲[mǔqīn] # 1 => 我的母亲[wǒ de(dī,dí,dì) mǔqīn] # 2 => wǒ de(dī,dí,dì) mǔqīn TRIM_CHARS = "\t\n |" # characters to be trimmed at both ends of pronunciation string ECHO_QUERY = '' # special character to let it echo query if result is not found ECHO_CHAR = args[:echo] # character(s) to be echoed if no result is found for query # if ECHO_QUERY is specified, query string is echoed for no result TRIM_CHARS_SET = NSCharacterSet.characterSetWithCharactersInString(TRIM_CHARS) ECHO_NS = ECHO_CHAR.to_ns ARGV.map {|a| a.to_ns }.each do |q| # [3] dd = [] while true do # # Until given query string (q) is exhausted, repeat as follows - # get longest leading substring (qu) of the query string matching a term in dictionary, # look the substring up in dictionary and retrieve title and pronunciation of the matching entry. # u = DCSGetTermRangeInString(dct, q, 0) # try to find longest leading range matching a term in dictionary u = NSMakeRange(0, 1) if u.location == KCFNotFound # fallback [4] qu = q.substringWithRange(u) rr = DCSCopyRecordsForSearchString(dct, qu, QUERY_METHOD, MAX_RECORD_COUNT) unless rr c = q.substringWithRange(NSMakeRange(0, 1)) # give up one character at the beginning dd << [[c, ECHO_CHAR == ECHO_QUERY ? c : ECHO_NS]] break if q.length < 2 q = q.substringFromIndex(1) else tt, pp = [], {} rr.each do |r| # r = DCSRecordRef # # parse xml representation of record entry to get title and pronunciation # xml = DCSRecordCopyData(r, 0) err = OCObject.new doc = NSXMLDocument.alloc.objc_send( :initWithXMLString, xml, :options, 0, :error, err) unless doc $stderr.puts "Failed to obtain XML document for %s: %s" % [qu, err.description] next end nn = doc.objc_send( :nodesForXPath, '//d:entry/@d:title', # d:title attribute :error, nil) title = nn && nn == [] ? ECHO_NS : nn.first.stringValue nn = doc.objc_send( :nodesForXPath, '//d:entry//span[@d:pr]', # span element with d:pr attribute :error, nil) pron = nn && nn == [] ? ECHO_NS : nn.first.stringValue pron = pron.stringByTrimmingCharactersInSet(TRIM_CHARS_SET). stringByReplacingOccurrencesOfString_withString(' ', '').lowercaseString tt << title unless tt.include?(title) title_s = title.to_s # for use as hash key in ruby if not pp.key?(title_s) pp[title_s] = [pron] elsif not pp[title_s].include?(pron) pp[title_s] << pron end end # # Let query_{k} denote sub-query for k-th substring defined by range u, # title_{k,i} denote i-th found title for query_{k}, # pron_{k,i,j} denote j-th pronunciation for title_{k,i}; # # array cc_k holds each collection of pronunciations per tile_{k,i} found for query_{k}: # cc_k = [c_{k,1}, c_{k,2}, ...] # c_{k,i} = [ title_{k,i}, pron_{k,i,1} *1( '(' pron_{k,i,2} ',' pron_{k,i,3} ',' ... ')' ) ] # # array dd holds list of cc_k for every sub-query_{k} # dd = [cc_1, cc_2, ...] # cc_k = tt.map do |t| a = pp[t.to_s] [t, a.shift + (a == [] ? '' : "(%s)" % a.join(','))] end dd << cc_k k = u.location + u.length break unless k < q.length q = q.substringFromIndex(k) end end case OUTPUT_FORMAT # 0 = interleaved : H[p] H[p]... # 1 = separate : H H...[p p...] # 2 = pinyin only : p p... when 0 ee = dd.map do |cc| next '' if cc == [] ("%s[%s]" % cc.shift) + (cc == [] ? '' : "(%s)" % cc.map {|c| "%s[%s]" % c}.join(',')) end puts ee.join(' ') when 1 aa = dd.map do |cc| a, b = cc.transpose next '' unless a (a.shift) + (a == [] ? '' : "(%s)" % a.join(',')) end bb = dd.map do |cc| a, b = cc.transpose next '' unless b (b.shift) + (b == [] ? '' : "(%s)" % b.join(',')) end puts "%s[%s]" % [aa.join(' '), bb.join(' ')] when 2 bb = dd.map do |cc| a, b = cc.transpose next '' unless b (b.shift) + (b == [] ? '' : "(%s)" % b.join(',')) end puts bb.join(' ') end end # # [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed. # I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc, # {??=qq} should have been {_CFRange=qq} # {??=ii} should have been {_CFRange=ii} # [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file. # It now includes signatures for several undocumented functions as well. # [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8). # [4] DCSGetTermRangeInString(dct, q, 0) returning range [KCFNotFound, 0] does not necessarily mean q's 1st character # as query may not match any term in dictionary. It is necessary to use DCSCopyRecordsForSearchString() # for the 1st character in order to know the (existence of) matching term(s). # EOF } hanzi2pinyin "$@"

Tested under 10.6.8. As I used undocumented functions, they may have changed in later versions, which will break the script.

Hope this may help,

Hiroto

Level 5

7,461 points

Sep 16, 2014 2:35 PM in response to mingsai

Hello

Here's another tool. It is much simpler and easier. BUT... if only it is more accurate in hanzi to pinyin transliteration.

The key function is CFStringTransform() which implements ICU transliteration. The problem is that ICU Han-Latin transliteration is per character basis and thus may yield really wrong result when a chararter has multiple readings in context. It is too inaccurate to be used as a learning tool.

cf.

http://userguide.icu-project.org/transforms/general

However, Traditional-Simplified and Simplified-Traditional transliterators will be handy and useful.

#!/bin/bash hanzi_trans() { # # $@ = <ICU Transliterator ID> string [string ...] # /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - "$@" require 'osx/cocoa' include OSX icu_tr = ARGV.shift; ARGV.map {|a| a.to_ns}.each do |q| CFStringTransform(q, nil, icu_tr, false) puts q end EOF } hanzi_trans 'Han-Latin' '悟空搗鬼花果山' # => wù kōng dǎo guǐ huā guǒ shān (CORRECT) hanzi_trans 'Han-Latin' '大地的女儿' # => dà de de nǚ r (WRONG) hanzi_trans 'Simplified-Traditional' '简体字' '繁体字' # => 簡體字 \n 繁體字 hanzi_trans 'Traditional-Simplified' '簡體字' '繁體字' # => 简体字 \n 繁体字 # # * Note that "Han-Latin" ICU transliteration seems to be done per Han character and thus may yield wrong result. # # E.g. # # CORRECT # 天地　= tiāndì # 地球 = dìqiú # 大地的女儿 = dàdì de nǚ'ér # # WRONG (ICU transliteration result) # 天地　= tiān de # 地球 = de qiú # 大地的女儿 = dà de de nǚ r #

Regards,

SGIII

Level 8

36,134 points

Sep 17, 2014 12:27 PM in response to Hiroto

Hi H,

Nice work! I would like to try your script in 10.9 (Mavericks) but am a little wobbly on installing scripts and getting them to work.

I think I've found a way to unhide /user/local/bin/ (in Mavericks you can't just show up in Finder > Go > Go to Folder... and type it in).

Then what do I do?

Save the script just as hanzi2pinyin? Or do I need a specific file extension such as .rb?

How do I then chmod a+x it?

And how would I call it from AppleScript do shell script? Pipe in as stdin?

Thanks,

Hiroto

Level 5

7,461 points

Sep 18, 2014 6:30 AM in response to SGIII

Hello SG,

You may save the script anywhere you want, not necessarily in /usr/local/bin, and call it by specifying its path.

Anyway, you may do as follows to install it in /usr/local/bin, provided you saved the script as UTF-8 plain text file named "hanzi2pinyin" on desktop. Name extension is optional but none would be better for script used as user command.

#!/bin/bash # # install /usr/local/bin/hanzi2pinyin # # 1) save the script as utf8 plain text file in ~/desktop/hanzi2pinyin # 2) run the following commands in Terminal.app # cd ~/desktop || exit chmod a+x hanzi2pinyin [[ -d /usr/local/bin ]] || sudo mkdir /usr/local/bin sudo cp -pP hanzi2pinyin /usr/local/bin

You may manually type these in Terminal or you may save the above code as plain text file named "install.command" on desktop and double click it. Either way, script will ask you to enter your account password.

Once the command is installed, applescript would be something like this.

--set dictf to "/Library/Dictionaries/Simplified Chinese - English.dictionary" --set dictf to "/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary" --set dictf to "/Library/Dictionaries/小词典.dictionary" --set dictf to "/Library/Dictionaries/小词典－繁体字.dictionary" set dictf to "/Library/Dictionaries/CC-CEDICT.dictionary" set query to "悟空捣鬼花果山" --hanzi2pinyin(dictf, 10, 0, query) hanzi2pinyin(dictf, 10, 1, query) --hanzi2pinyin(dictf, 10, 2, query) on hanzi2pinyin(dictf, max_count, output_format, query) (* string dictf : POSIX path of dictionary file integer max_count : Max record count to retrieve integer output_format : Output format. 0 = interleaved : H[p] H[p]... 1 = separate : H H...[p p...] 2 = pinyin only : p p... string query : query string return string : Hanzi[pinyin] in specified output format *) do shell script "d=" & dictf's quoted form & "; c=" & max_count & "; o=" & output_format & " /usr/local/bin/hanzi2pinyin -d \"$d\" -c \"$c\" -o \"$o\" -e -- " & query's quoted form end hanzi2pinyin

Good luck,

How can I use Automator to extract substring of text based on pattern?