mingsai

Q: How can I use Automator to extract substring of text based on pattern?

I have inbound text in a workflow and I want to extract a substring from the text.

 

(inbound text)

 

我 wǒ 代 ① (指一人) (用作主语) I (用作宾语) me (表所属关系) my 告诉我 tell me 我为人人,人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行! I think I can manage it. ② (指两人或以上) (用作主语) we (用作宾语) us (表所属关系) our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ (表泛指) [used together with 你 in parallel structures] anyone 大家你一言,我一语,献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ (指自我) self → 忘我, 自我

 

I basically only want the non-double byte characters between after the first character and the first occurrence of ① : (see sample)

 

代 ①

 

Using regex101.com I have been able to determine that this regex pattern should produce the required results but I need help getting these results into automator:

 

/ ([a-z].) /

MacBook Air (13-inch Mid 2012), Mac OS X (10.7.5), Love the mac (OSX 10.9+)

Posted on Sep 9, 2014 3:13 PM

Close

Q: How can I use Automator to extract substring of text based on pattern?

  • All replies
  • Helpful answers

first Previous Page 4 of 4
  • by mingsai,

    mingsai mingsai Sep 25, 2014 12:28 PM in response to Hiroto
    Level 1 (30 points)
    Sep 25, 2014 12:28 PM in response to Hiroto

    Thanks for the update. I haven't tried the new rubycocoa yet but I was able to workaround the issue by copying over the prior version of the Ruby Frameworks into my system and pointing to the original source in the script. This enabled me to validate that the original script does work on Yosemite (beta 8). The other dictionaries did not return good results but the first item produced the desired results.


               on run {input}

     

      set dictf to "/Library/Dictionaries/Simplified Chinese - English.dictionary"

      --set dictf to "/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary"

      --set dictf to "/Library/Dictionaries/小词典.dictionary"

      --set dictf to "/Library/Dictionaries/unihan.dictionary"

      --set dictf to "/Library/Dictionaries/小词典-繁体字.dictionary"

      --set dictf to "/Library/Dictionaries/CC-CEDICT.dictionary"

     

      set query to input as Unicode text

      --hanzi2pinyin(dictf, 10, 0, query)

      --hanzi2pinyin(dictf, 10, 1, query)

     

      set pinyinText to hanzi2pinyin(dictf, 10, 2, query)

     

     

      input & pinyinText


    end run


    on hanzi2pinyin(dictf, max_count, output_format, query)

      (*

              string dictf : POSIX path of dictionary file

              integer max_count : Max record count to retrieve

              integer output_format : Output format.

                      0 = interleaved : H[p] H[p]...

                      1 = separate    : H H...[p p...]

                      2 = pinyin only : p p...

              string query : query string

              return string : Hanzi[pinyin] in specified output format

        *)

      do shell script "d=" & dictf's quoted form & "; c=" & max_count & "; o=" & output_format & "

    /usr/local/bin/h2p -d \"$d\" -c \"$c\" -o \"$o\" -e -- " & query's quoted form


    end hanzi2pinyin

  • by SGIII,

    SGIII SGIII Sep 25, 2014 12:50 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 25, 2014 12:50 PM in response to mingsai

    Great to hear it will works in Yosemite too.

     

    In Mavericks I've tried the different dictionaries.

     

    Good results:

     

    CC-CEDICT.dictionary --> wu4kong1 dao3gui3 hua1guo3shan1

    小词典.dictionary --> wùkōng dǎoguǐ huāguǒshān

     

    A little bit off:


    小词典-繁体字.dictionary  ---> wùkōng guǐ huāguǒshān

     

    Almost correct but "wordification" a little off and multiple pronunciations for 多音字 perhaps messy for some purposes:

     

    The Standard Dictionary of Contemporary Chinese.dictionary  -->  wù kōng(kòng) dǎoguǐ huāguǒshān

    Simplified Chinese - English.dictionary  -->  wù kōng(kòng) dǎoguǐ huā guǒ shān

     

    SG

  • by mingsai,

    mingsai mingsai Sep 25, 2014 1:29 PM in response to SGIII
    Level 1 (30 points)
    Sep 25, 2014 1:29 PM in response to SGIII

    SGIII and Hiroto,

     

    FYI - I received the following results (in parentheses) using the various dictionaries on my system.

     

      --(wǒ shì měiguó rén)set dictf to "/Library/Dictionaries/Simplified Chinese - English.dictionary"

      --(wǒ shì měi guórén)set dictf to "/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary"

      --(failed, I may not have this dictionary)set dictf to "/Library/Dictionaries/小词典.dictionary"

      --(empty)set dictf to "/Library/Dictionaries/unihan.dictionary"

      --(wǒ shì měi rén) set dictf to "/Library/Dictionaries/小词典-繁体字.dictionary"

      --( )set dictf to "/Library/Dictionaries/小词典-英语.dictionary"

      --(  )set dictf to "/Library/Dictionaries/kangxizidian.dictionary"

      -- ( ) set dictf to "/Library/Dictionaries/BHSD.dictionary"

      --(  () << empty set)set dictf to "/Library/Dictionaries/CC-CEDICT.dictionary"

  • by Hiroto,

    Hiroto Hiroto Sep 26, 2014 12:31 PM in response to SGIII
    Level 5 (7,348 points)
    Sep 26, 2014 12:31 PM in response to SGIII

    Hello SG,

     

    小词典-繁体字.dictionary returns the said result because 捣 is a simplified hanzi character. Traditional equivalent character is 搗.

     

    Given the query = 悟空搗鬼花果山, 小词典-繁体字.dictionary will return 悟空 搗鬼 花果山[wùkōng dǎoguǐ huāguǒshān].

     

    Also, if you want to suppress alternative readings returned in paretheses, you may set the max_count parameter to 1 in the AppleScript code. (Or specify -c1 option in shell code). However, please note that the first match in the dictionary is not necessarily correct.

     

    Regards,

    H

  • by Hiroto,

    Hiroto Hiroto Sep 26, 2014 12:35 PM in response to mingsai
    Level 5 (7,348 points)
    Sep 26, 2014 12:35 PM in response to mingsai

    Hello mingsai,

     

    The result of the script will depend upon the dictionary in use, for different dictionary can use different XSL stylesheet to generate XML (XHTML) representation of its record entry. If XML data does not have the element and attributes the script expects, parser should fail and return Not-found result.

     

     

    E.g., 小词典.dictionary will return the following XML data for query word = '我'

     

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    </head>
      <body>
        <d:entry id="xcd09325cdd85ded7c7cab6332260857fbd" d:title="我">
        <span class="syntax"><span d:pr="US">wǒ</span></span>
        <h1>我</h1>
        <div>I; me; my</div>
        <h3>我</h3>
    <div class="editEntry">
        <a href="http://xiaocidian.com/a/index.php?word=%E6%88%91">Edit CC-CEDICT Entry</a>
    </div>
    </d:entry>
      </body>
    </html>
    

     

     

    where the script is retrieving title as the value at XPath //d:entry/@d:title and pronunciation at //d:entry//span[@d:pr].

     

    After all, you'd have to prepare the parsing logic in hanzi2pinyin script according to the XML structure of record entry of any given dictionary. There would be no universal XPath or regex pattern for this. If you wish, you may use the following script to obtain the XML data of an entry of specified dictionary.

     

     

    #!/bin/bash
    
    # dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary'
    # dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
    dictf='/Library/Dictionaries/小词典.dictionary'
    # dictf='/Library/Dictionaries/小词典-繁体字.dictionary'
    # dictf='/Library/Dictionaries/小词典-英语.dictionary'
    # dictf='/Library/Dictionaries/CC-CEDICT.dictionary'
    
    # CMD=/usr/local/bin/dictionary_record_data.rb
    CMD=~/desktop/dictionary_record_data.rb
    
    "$CMD" -d "$dictf" -c10 -o0 '我'
    

     

     

    provided that you saved the following ruby script as dictionary_record_data.rb on desktop.

     

     

    dictionary_record_data.rb

     

    #!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w
    # coding: utf-8
    
    require 'optparse'
    require 'osx/cocoa'
    include OSX
    # OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework'     # [1]
    
    while File.exist?(BSFILE = File.expand_path("~/desktop/DictionaryServices.#{rand(1e10)}.bridgesupport")) do end
    Signal.trap("EXIT") { File.delete BSFILE if File.exist?(BSFILE) }
    File.open(BSFILE, "w") { |f| f.print DATA.read }
    OSX.load_bridge_support_file BSFILE # [2]
    File.delete BSFILE if File.exist?(BSFILE)
    
    # -----------------------------------------------------
    #   * some DictionaryServices functions (OS X 10.6.8)
    #   
    #   extern CFArrayRef DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long)
    #       unsigned long long method
    #           0   = exact match
    #           1   = forward match (prefix match)
    #           2   = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc)
    #           >=3 = ? (exact match?)
    #       
    #       long long max_record_count
    # 
    #   extern CFStringRef DCSRecordCopyData (DCSRecordRef, long)
    #       long output_style
    #           0 = XML XHTML <html> string
    #           1 = XML XHTML <html> string
    #           2 = XML XHTML <html> string
    #           3 = plain text
    #           4 = XML XHTML <text> string (single element)
    #       * corresponding to (?)
    #           Transform.xsl
    #           TransformApp.xsl
    #           TransformPanel.xsl
    #           TransformSimpleText.xsl
    #           TransformText.xsl
    # -----------------------------------------------------
    def dict(argv)
        # 
        #   argv = options query [query ...]
        #       -d, --dictionary DICTIONARY      Dictionary file.
        #       -c, --count COUNT                Max record count to retrieve (=10).
        #       -o, --output FORMAT              Output format (=0).
        #                                          0 = XML (XHTML) <html> string
        #                                          1 = XML (XHTML) <html> string
        #                                          2 = XML (XHTML) <html> string
        #                                          3 = plain text
        #                                          4 = XML (XHTML) <text> string
        #       -h, --help                       Display this help.
        # 
        args = {
            :dictf  => nil,
            :count  => 10,
            :output => 0,
        }
        op = OptionParser.new do|o|
            o.banner = "Usage: #{File.basename($0)} options query [query ...]"      
            o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f|
                args[:dictf] = f
            end
            o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i|
                raise OptionParser::InvalidArgument, i unless i.to_i > 0
                args[:count] = i.to_i
            end
            o.on('-o', '--output FORMAT', Integer, "Output format (=0).", 
                "  0 = XML (XHTML) <html> string", 
                "  1 = XML (XHTML) <html> string", 
                "  2 = XML (XHTML) <html> string", 
                "  3 = plain text", 
                "  4 = XML (XHTML) <text> string") do |i|
                raise OptionParser::InvalidArgument, i unless [0, 1, 2, 3, 4].include?(i.to_i)
                args[:output] = i.to_i
            end
            o.on( '-h', '--help', 'Display this help.' ) do
                $stderr.puts o; exit 1
            end
        end
        begin
            op.parse!(argv)
        rescue => ex
            $stderr.puts "#{ex.class} : #{ex.message}"
            $stderr.puts op.help(); exit 1
        end
        if argv.length == 0
            $stderr.puts op.help(); exit 1
        end
    
        if (dctf = args[:dictf])
            unless File.exists?(dctf)
                $stderr.puts "No such dictionary: %s" % dctf
                exit 1
            end
            url = NSURL.fileURLWithPath(dctf)
            dcts = dcts.allObjects if (dcts = DCSCopyAvailableDictionaries()).is_a? NSSet   # [5]
            dct, = dcts.select { |d| DCSDictionaryGetURL(d).path == url.path }
            unless dct
                $stderr.puts "Failed to get dictionary for: %s" % dctf
                exit 2
            end
        else
            dct, = DCSGetActiveDictionaries()
            unless dct
                $stderr.puts "Failed to get the 1st active dictionary"
                exit 2
            end
        end
    
        max_count    = args[:count]
        ouput_format = args[:output]
        argv.map {|a| a.to_ns }.each do |q|     # [3]
            rr = DCSCopyRecordsForSearchString(dct, q, 0, max_count)
            unless rr
                puts "Not found: %s" % q
                next
            end
            rr.each do |r|  # r = DCSRecordRef
                data = DCSRecordCopyData(r, ouput_format)
                puts data
            end
        end
        # 
        #   [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed.
        #       I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc,
        #           {??=qq} should have been {_CFRange=qq}
        #           {??=ii} should have been {_CFRange=ii}
        #   [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file.
        #       It now includes signatures for several undocumented functions as well.
        #   [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8).
        # 
    end
    
    dict(ARGV)
    exit
    
    
    # ---- test code begins ----
    # dictf = '/Library/Dictionaries/Simplified Chinese - English.dictionary'
    # dictf = '/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
    dictf = '/Library/Dictionaries/小词典.dictionary'
    # dictf = '/Library/Dictionaries/小词典-繁体字.dictionary'
    # dictf = '/Library/Dictionaries/小词典-英语.dictionary'
    # dictf = '/Library/Dictionaries/CC-CEDICT.dictionary'
    
    argv = ['-d', dictf, '我']
    dict(argv)
    # ---- test code ends ----
    
    
    __END__
    <?xml version="1.0" standalone="yes"?>
    <!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd">
    <signatures version="0.9">
        <function name="DCSCopyTextDefinition">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSGetTermRangeInString">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type64="q" type="l"></arg>
            <retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval>
        </function>
        <function name="DCSDictionaryCreate">
            <arg type="^{__CFURL=}"></arg>
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSGetActiveDictionaries">
            <retval type="^{__CFArray=}"></retval>
        </function>
        <function name="DCSCopyAvailableDictionaries">
            <retval type="^{__CFSet=}"></retval>
        </function>
        <function name="DCSGetDefaultDictionary">
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSGetDefaultThesaurus">
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSDictionaryGetURL">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFURL=}"></retval>
        </function>
        <function name="DCSDictionaryGetName">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSDictionaryGetIdentifier">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSCopyRecordsForSearchString">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type="l"></arg>
            <arg type="l"></arg>
            <retval type="^{__CFArray=}"></retval>
        </function>
        <function name="DCSRecordGetHeadword">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetString">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetRawHeadword">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetTitle">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetAnchor">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetDataURL">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFURL=}"></retval>
        </function>
        <function name="DCSRecordCopyData">
            <arg type="^{__DCSRecord=}"></arg>
             <arg type="l"></arg>
           <retval type="^{__CFString=}"></retval>
        </function>
    </signatures>
    

     

     

    All the best,

    H

  • by SGIII,

    SGIII SGIII Sep 26, 2014 12:47 PM in response to Hiroto
    Level 6 (10,782 points)
    Mac OS X
    Sep 26, 2014 12:47 PM in response to Hiroto

    Got it.  Thanks!

     

    SG

  • by mingsai,

    mingsai mingsai Sep 26, 2014 7:18 PM in response to Hiroto
    Level 1 (30 points)
    Sep 26, 2014 7:18 PM in response to Hiroto

    This is great, I will have to take another look at how the script reads the xsl templates but using this method one should be able to extend the script for any dictionary that has the pronunciation information.

     

    Thanks again!

first Previous Page 4 of 4