You can make a difference in the Apple Support Community!

When you sign up with your Apple Account, you can provide valuable feedback to other community members by upvoting helpful replies and User Tips.

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

How can I use Automator to extract substring of text based on pattern?

I have inbound text in a workflow and I want to extract a substring from the text.


(inbound text)


我 wǒ 代 ① (指一人) (用作主语) I (用作宾语) me (表所属关系) my 告诉我 tell me 我为人人,人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行! I think I can manage it. ② (指两人或以上) (用作主语) we (用作宾语) us (表所属关系) our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ (表泛指) [used together with 你 in parallel structures] anyone 大家你一言,我一语,献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ (指自我) self → 忘我, 自我


I basically only want the non-double byte characters between after the first character and the first occurrence of ① : (see sample)


代 ①


Using regex101.com I have been able to determine that this regex pattern should produce the required results but I need help getting these results into automator:


/ ([a-z].) /

MacBook Air (13-inch Mid 2012), Mac OS X (10.7.5), Love the mac (OSX 10.9+)

Posted on Sep 9, 2014 2:59 PM

Reply
Question marked as Top-ranking reply

Posted on Sep 9, 2014 7:24 PM

If your pinyin substring never has spaces in it, then you can use AppleScript to extract it like this:


User uploaded file



This is the script (with the sample input). Copy/paste into AppleScript Editor and click the green triangle 'Run' button:



set input to " wǒ (指一人) (用作主语) I (用作宾语) me (表所属关系) my 告诉我 tell me 我为人人,人人为我 one for all and all for one 我爸/ my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行! I think I can manage it. (指两人或以上) (用作主语) we (用作宾语) us (表所属关系) our 我厂/// our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. 我方, 敌我矛盾 (表泛指) [used together with in parallel structures] anyone 大家你一言,我一语,献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. 尔虞我诈, 你死我活 (指自我) self 忘我, 自我"


set pyCharSet to {"a", "ā", "á", "ǎ", "à", "b", "c", "d", "e", "ē", "é", "ě", "è", "f", "g", "h", "i", "ī", "í", "ǐ", "ì", "j", "k", "l", "m", "n", "o", "ō", "ó", "ǒ", "ò", "p", "q", "r", "s", "t", "u", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ", "w", "x", "y", "z"}

set {oTID, AppleScript'stext item delimiters} to {AppleScript'stext item delimiters, ""}

set cc to (input as string)'s text items's item 1's characters 2 thru -1

set py to ""

repeat with c in cc

if c is in pyCharSet then set py to py & c

end repeat

set AppleScript'stext item delimiters to oTID

return py



The above view is AppleScript Editor, not Automator. What do you plan to do with the results in Automator?


SG

51 replies

Sep 25, 2014 12:50 PM in response to mingsai

Great to hear it will works in Yosemite too.


In Mavericks I've tried the different dictionaries.


Good results:


CC-CEDICT.dictionary --> wu4kong1 dao3gui3 hua1guo3shan1

小词典.dictionary --> wùkōng dǎoguǐ huāguǒshān


A little bit off:

小词典-繁体字.dictionary ---> wùkōng guǐ huāguǒshān


Almost correct but "wordification" a little off and multiple pronunciations for 多音字 perhaps messy for some purposes:


The Standard Dictionary of Contemporary Chinese.dictionary --> wù kōng(kòng) dǎoguǐ huāguǒshān

Simplified Chinese - English.dictionary --> wù kōng(kòng) dǎoguǐ huā guǒ shān


SG

Sep 25, 2014 1:29 PM in response to SGIII

SGIII and Hiroto,


FYI - I received the following results (in parentheses) using the various dictionaries on my system.


--(wǒ shì měiguó rén)set dictf to "/Library/Dictionaries/Simplified Chinese - English.dictionary"


--(wǒ shì měi guórén)set dictf to "/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary"


--(failed, I may not have this dictionary)set dictf to "/Library/Dictionaries/小词典.dictionary"


--(empty)set dictf to "/Library/Dictionaries/unihan.dictionary"


--(wǒ shì měi rén) set dictf to "/Library/Dictionaries/小词典-繁体字.dictionary"


--( )set dictf to "/Library/Dictionaries/小词典-英语.dictionary"


--( )set dictf to "/Library/Dictionaries/kangxizidian.dictionary"


-- ( ) set dictf to "/Library/Dictionaries/BHSD.dictionary"


--( () << empty set)set dictf to "/Library/Dictionaries/CC-CEDICT.dictionary"

Sep 26, 2014 12:31 PM in response to SGIII

Hello SG,


小词典-繁体字.dictionary returns the said result because 捣 is a simplified hanzi character. Traditional equivalent character is 搗.


Given the query = 悟空搗鬼花果山, 小词典-繁体字.dictionary will return 悟空 搗鬼 花果山[wùkōng dǎoguǐ huāguǒshān].


Also, if you want to suppress alternative readings returned in paretheses, you may set the max_count parameter to 1 in the AppleScript code. (Or specify -c1 option in shell code). However, please note that the first match in the dictionary is not necessarily correct.


Regards,

H

Sep 26, 2014 12:35 PM in response to mingsai

Hello mingsai,


The result of the script will depend upon the dictionary in use, for different dictionary can use different XSL stylesheet to generate XML (XHTML) representation of its record entry. If XML data does not have the element and attributes the script expects, parser should fail and return Not-found result.



E.g., 小词典.dictionary will return the following XML data for query word = '我'


<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body> <d:entry id="xcd09325cdd85ded7c7cab6332260857fbd" d:title="我"> <span class="syntax"><span d:pr="US">wǒ</span></span> <h1>我</h1> <div>I; me; my</div> <h3>我</h3> <div class="editEntry"> <a href="http://xiaocidian.com/a/index.php?word=%E6%88%91">Edit CC-CEDICT Entry</a> </div> </d:entry> </body> </html>



where the script is retrieving title as the value at XPath //d:entry/@d:title and pronunciation at //d:entry//span[@d:pr].


After all, you'd have to prepare the parsing logic in hanzi2pinyin script according to the XML structure of record entry of any given dictionary. There would be no universal XPath or regex pattern for this. If you wish, you may use the following script to obtain the XML data of an entry of specified dictionary.



#!/bin/bash # dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary' # dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary' dictf='/Library/Dictionaries/小词典.dictionary' # dictf='/Library/Dictionaries/小词典-繁体字.dictionary' # dictf='/Library/Dictionaries/小词典-英语.dictionary' # dictf='/Library/Dictionaries/CC-CEDICT.dictionary' # CMD=/usr/local/bin/dictionary_record_data.rb CMD=~/desktop/dictionary_record_data.rb "$CMD" -d "$dictf" -c10 -o0 '我'



provided that you saved the following ruby script as dictionary_record_data.rb on desktop.



dictionary_record_data.rb


#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w # coding: utf-8 require 'optparse' require 'osx/cocoa' include OSX # OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework' # [1] while File.exist?(BSFILE = File.expand_path("~/desktop/DictionaryServices.#{rand(1e10)}.bridgesupport")) do end Signal.trap("EXIT") { File.delete BSFILE if File.exist?(BSFILE) } File.open(BSFILE, "w") { |f| f.print DATA.read } OSX.load_bridge_support_file BSFILE # [2] File.delete BSFILE if File.exist?(BSFILE) # ----------------------------------------------------- # * some DictionaryServices functions (OS X 10.6.8) # # extern CFArrayRef DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long) # unsigned long long method # 0 = exact match # 1 = forward match (prefix match) # 2 = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc) # >=3 = ? (exact match?) # # long long max_record_count # # extern CFStringRef DCSRecordCopyData (DCSRecordRef, long) # long output_style # 0 = XML XHTML <html> string # 1 = XML XHTML <html> string # 2 = XML XHTML <html> string # 3 = plain text # 4 = XML XHTML <text> string (single element) # * corresponding to (?) # Transform.xsl # TransformApp.xsl # TransformPanel.xsl # TransformSimpleText.xsl # TransformText.xsl # ----------------------------------------------------- def dict(argv) # # argv = options query [query ...] # -d, --dictionary DICTIONARY Dictionary file. # -c, --count COUNT Max record count to retrieve (=10). # -o, --output FORMAT Output format (=0). # 0 = XML (XHTML) <html> string # 1 = XML (XHTML) <html> string # 2 = XML (XHTML) <html> string # 3 = plain text # 4 = XML (XHTML) <text> string # -h, --help Display this help. # args = { :dictf => nil, :count => 10, :output => 0, } op = OptionParser.new do|o| o.banner = "Usage: #{File.basename($0)} options query [query ...]" o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f| args[:dictf] = f end o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i| raise OptionParser::InvalidArgument, i unless i.to_i > 0 args[:count] = i.to_i end o.on('-o', '--output FORMAT', Integer, "Output format (=0).", " 0 = XML (XHTML) <html> string", " 1 = XML (XHTML) <html> string", " 2 = XML (XHTML) <html> string", " 3 = plain text", " 4 = XML (XHTML) <text> string") do |i| raise OptionParser::InvalidArgument, i unless [0, 1, 2, 3, 4].include?(i.to_i) args[:output] = i.to_i end o.on( '-h', '--help', 'Display this help.' ) do $stderr.puts o; exit 1 end end begin op.parse!(argv) rescue => ex $stderr.puts "#{ex.class} : #{ex.message}" $stderr.puts op.help(); exit 1 end if argv.length == 0 $stderr.puts op.help(); exit 1 end if (dctf = args[:dictf]) unless File.exists?(dctf) $stderr.puts "No such dictionary: %s" % dctf exit 1 end url = NSURL.fileURLWithPath(dctf) dcts = dcts.allObjects if (dcts = DCSCopyAvailableDictionaries()).is_a? NSSet # [5] dct, = dcts.select { |d| DCSDictionaryGetURL(d).path == url.path } unless dct $stderr.puts "Failed to get dictionary for: %s" % dctf exit 2 end else dct, = DCSGetActiveDictionaries() unless dct $stderr.puts "Failed to get the 1st active dictionary" exit 2 end end max_count = args[:count] ouput_format = args[:output] argv.map {|a| a.to_ns }.each do |q| # [3] rr = DCSCopyRecordsForSearchString(dct, q, 0, max_count) unless rr puts "Not found: %s" % q next end rr.each do |r| # r = DCSRecordRef data = DCSRecordCopyData(r, ouput_format) puts data end end # # [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed. # I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc, # {??=qq} should have been {_CFRange=qq} # {??=ii} should have been {_CFRange=ii} # [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file. # It now includes signatures for several undocumented functions as well. # [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8). # end dict(ARGV) exit # ---- test code begins ---- # dictf = '/Library/Dictionaries/Simplified Chinese - English.dictionary' # dictf = '/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary' dictf = '/Library/Dictionaries/小词典.dictionary' # dictf = '/Library/Dictionaries/小词典-繁体字.dictionary' # dictf = '/Library/Dictionaries/小词典-英语.dictionary' # dictf = '/Library/Dictionaries/CC-CEDICT.dictionary' argv = ['-d', dictf, '我'] dict(argv) # ---- test code ends ---- __END__ <?xml version="1.0" standalone="yes"?> <!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd"> <signatures version="0.9"> <function name="DCSCopyTextDefinition"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSGetTermRangeInString"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type64="q" type="l"></arg> <retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval> </function> <function name="DCSDictionaryCreate"> <arg type="^{__CFURL=}"></arg> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSGetActiveDictionaries"> <retval type="^{__CFArray=}"></retval> </function> <function name="DCSCopyAvailableDictionaries"> <retval type="^{__CFSet=}"></retval> </function> <function name="DCSGetDefaultDictionary"> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSGetDefaultThesaurus"> <retval type="^{__DCSDictionary=}"></retval> </function> <function name="DCSDictionaryGetURL"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFURL=}"></retval> </function> <function name="DCSDictionaryGetName"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSDictionaryGetIdentifier"> <arg type="^{__DCSDictionary=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSCopyRecordsForSearchString"> <arg type="^{__DCSDictionary=}"></arg> <arg type="^{__CFString=}"></arg> <arg type="l"></arg> <arg type="l"></arg> <retval type="^{__CFArray=}"></retval> </function> <function name="DCSRecordGetHeadword"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetString"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetRawHeadword"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetTitle"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetAnchor"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFString=}"></retval> </function> <function name="DCSRecordGetDataURL"> <arg type="^{__DCSRecord=}"></arg> <retval type="^{__CFURL=}"></retval> </function> <function name="DCSRecordCopyData"> <arg type="^{__DCSRecord=}"></arg> <arg type="l"></arg> <retval type="^{__CFString=}"></retval> </function> </signatures>



All the best,

H

How can I use Automator to extract substring of text based on pattern?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.