mingsai

Q: How can I use Automator to extract substring of text based on pattern?

I have inbound text in a workflow and I want to extract a substring from the text.

 

(inbound text)

 

我 wǒ 代 ① (指一人) (用作主语) I (用作宾语) me (表所属关系) my 告诉我 tell me 我为人人,人人为我 one for all and all for one 我爸/妈 my father/mother 我的祖国 my homeland 我现在没空。 I am busy at the moment. 我认为我行! I think I can manage it. ② (指两人或以上) (用作主语) we (用作宾语) us (表所属关系) our 我厂/国/校/军 our factory/country/school/army 敌军被我全歼。 The enemy was annihilated by us. → 我方, 敌我矛盾 ③ (表泛指) [used together with 你 in parallel structures] anyone 大家你一言,我一语,献计献策。 They had a brainstorming session with anyone and everyone joining in. 市场里你来我往非常热闹。 The market is bustling with people coming and going. → 尔虞我诈, 你死我活 ④ (指自我) self → 忘我, 自我

 

I basically only want the non-double byte characters between after the first character and the first occurrence of ① : (see sample)

 

代 ①

 

Using regex101.com I have been able to determine that this regex pattern should produce the required results but I need help getting these results into automator:

 

/ ([a-z].) /

MacBook Air (13-inch Mid 2012), Mac OS X (10.7.5), Love the mac (OSX 10.9+)

Posted on Sep 9, 2014 3:13 PM

Close

Q: How can I use Automator to extract substring of text based on pattern?

  • All replies
  • Helpful answers

Previous Page 2 of 4 last Next
  • by mingsai,

    mingsai mingsai Sep 11, 2014 8:52 AM in response to SGIII
    Level 1 (30 points)
    Sep 11, 2014 8:52 AM in response to SGIII

    This is very good but should probably incorporate a space between each pinyin word.

  • by SGIII,

    SGIII SGIII Sep 11, 2014 2:31 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 11, 2014 2:31 PM in response to mingsai

    This may be close to what you're looking for.  In my (limited) testing it returns the first occurrence of what looks to be pinyin, though it also returns false positives if the definition were to include an "English-looking" word before the pinyin.

     

    set pyCharSet to {"a", "ā", "á", "ǎ", "à", "b", "c", "d", "e", "ē", "é", "ě", "è", "f", "g", "h", "i", "ī", "í", "ǐ", "ì", "j", "k", "l", "m", "n", "o", "ō", "ó", "ǒ", "ò", "p", "q", "r", "s", "t", "u", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ", "w", "x", "y", "z"}

     

    set ww to input's paragraphs's item 1's words

    repeat with i from 1 to count ww

      set w to ww's item i

      if w's every character's item 1 is in pyCharSet then return w

    end repeat


     

    Am experimenting with the 'considering diacriticals' AppleScript statement to see if I can get it to only match words with accented characters and then, if it doesn't, to check a list of common neutral tones ("de", "zhe", "le", "ba", "ma", "ne", "a", "lai", "ge", etc).  So far no joy, though I do seem to have better luck with AppleScript than with shell scripts in avoiding spurious results with Chinese.

     

    SG

  • by SGIII,

    SGIII SGIII Sep 11, 2014 7:47 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 11, 2014 7:47 PM in response to mingsai

    This is closer.  It returns the first instance of a pinyin word containing a diacritical (tone mark), ignoring any romanized words before that, unless they are in a set of common neutral tone syllables.

     

    It could be modified to extract all instances of pinyin in a text.

     

    SG

     

    set pinyinDiacriticals to {"ā", "á", "ǎ", "à", "ē", "é", "ě", "è", "ī", "í", "ǐ", "ì", "ō", "ó", "ǒ", "ò", "ū", "ú", "ǔ", "ù", "ǖ", "ǚ", "ǜ"}

    set neutralToneWords to {"a", "ba", "bian", "de", "di", "ge", "jie", "lai", "le", "ma", "ne", "tou", "yi", "zhe"}

     

    set tt to input's paragraphs's contents as text -- merge paragraphs

    set ww to tt's words --split the text into words

    repeat with i from 1 to count ww

      set w to ww's item i

      set cc to w's characters -- split a word into characters

      repeat with c in cc

      if c is in pinyinDiacriticals then return w

      end repeat

      if w is in neutralToneWords then return w

    end repeat


  • by mingsai,

    mingsai mingsai Sep 12, 2014 8:28 PM in response to SGIII
    Level 1 (30 points)
    Sep 12, 2014 8:28 PM in response to SGIII

    SGIII,

     

    This is very close indeed. Here's an idea - How about reversing the logic related to the neutralToneWords so that a search is if w is not in chineseLanguageCharacterSet then return w. Would be great if applescript could exclude all Chinese characters in this way. I've been examining the Character Sets available in MacOSX using the Character Viewer. I've noted that the system has the ability to categorize Hanzi very neatly.  One possible solution would be to exclude the entire set and take the first non-Chinese word containing letters.

     

    Screen Shot 2014-09-12 at 11.23.57 PM.png

     

    BTW - I was able to extend my Character Viewer character sets by clicking the settings gear and choosing Customize List...

    Screen Shot 2014-09-12 at 11.26.49 PM.png

  • by SGIII,

    SGIII SGIII Sep 13, 2014 4:38 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 13, 2014 4:38 PM in response to mingsai

    How about reversing the logic related to the neutralToneWords so that a search is if w is not in chineseLanguageCharacterSet then return w. ... One possible solution would be to exclude the entire set and take the first non-Chinese word containing letters.

     

    This approach turns out to be pretty easy in AppleScript. The isCJK() handler in the following script shows one way to test whether a character is Chinese (or Korean or Japanese).  In AppleScript a character's id property is the Unicode number expressed in decimal.  So I looked up the relevant blocks in the Unicode charts and converted the hex values there to decimal.


    set tt to input's paragraphs's contents as text -- merge paragraphs

    set ww to tt's words --split the text into words

    repeat with i from 1 to count ww

      set w to ww's item i

      set cc to w's characters -- split a word into characters

      repeat with c in cc

      if not isCJK(c) then return w

      end repeat

    end repeat

     

    on isCJK(c)

      -- if working with rare characters, first uncomment A. Unlikely to need B!

      -- c's id is greater than 131071 and c's id is less than 173783 -- extension B (really rare)

      -- c's id is greater than 13311 and c's id is less than 19894 -- 3400-4DB5 - extension A (rare)

      c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

    end isCJK



    The A and B extensions probably aren't necessary in most situations (except for historians) so I commented them out.  If needed, one would uncomment the A extension first, and finally B if needed.

     

    Unlike the previous script, this variation grabs the first thing that isn't CJK, whether it's pinyin or something else.

     

    SG

  • by mingsai,

    mingsai mingsai Sep 13, 2014 11:12 PM in response to SGIII
    Level 1 (30 points)
    Sep 13, 2014 11:12 PM in response to SGIII

    Masterful. I will test and advise on utility among the various different dictionaries.

  • by VikingOSX,

    VikingOSX VikingOSX Sep 14, 2014 4:54 AM in response to mingsai
    Level 7 (21,314 points)
    Mac OS X
    Sep 14, 2014 4:54 AM in response to mingsai

    I realized that my regular expression was tuned to , but not to multi-tonal (if that is an applicable expression) words such as tiger 老虎 lǎohǔ. I then came across the following post that shows how to construct an elaborate regular expression for detecting hanzi. Would this be of value to you?

  • by SGIII,

    SGIII SGIII Sep 14, 2014 8:44 AM in response to VikingOSX
    Level 6 (10,782 points)
    Mac OS X
    Sep 14, 2014 8:44 AM in response to VikingOSX

    Great link. Have saved that one.  Complicated stuff!

     

    Made a mistake in the AppleScript handler.  If testing for rare characters too, better would be:

     

    on isCJK(c)

      if c's id is greater than 131071 and c's id is less than 173783 then return true -- extension B (really rare)

      if c's id is greater than 13311 and c's id is less than 19894 then return true -- 3400-4DB5 - extension A (rare)

      c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

    end isCJK

     

    In normal usage, could comment out the two 'if' statements.

     

    SG

  • by mingsai,

    mingsai mingsai Sep 14, 2014 12:05 PM in response to VikingOSX
    Level 1 (30 points)
    Sep 14, 2014 12:05 PM in response to VikingOSX

    That link was a really good find.

     

    Given the regex I was able to put several tests into regex101.com:

     

    \s+[^\s]+(?![p{InCJKUnifiedIdeographs}p{InCJKUnifiedIdeographsExtensionA}p{InCJK CompatibilityIdeographs}x{30FB}x{FF0C}x{3007}x{FF21}x{FF3A}[0-9]])

     

    This is the logic we are translating into applescript.

     

    Here is an example where the expression breaks down: It would be much better to get a count of the number of chinese words at the beginning of the input and use that as a variable in the regex to iterate the correct number of non-CJK characters.

    Screen Shot 2014-09-14 at 11.09.17 AM.png

     

    Here we can see that when the non-CJK follows immediately, the reqex works well.

    Screen Shot 2014-09-14 at 11.09.37 AM.png

     

    It even works when the pinyin falls at the end of the line.

    Screen Shot 2014-09-14 at 11.11.32 AM.png

  • by mingsai,

    mingsai mingsai Sep 14, 2014 12:07 PM in response to SGIII
    Level 1 (30 points)
    Sep 14, 2014 12:07 PM in response to SGIII

    If the last statement were made into:

     

    if c... then return true

    else

    return false

     

    The code paths would all return a valid boolean.

  • by SGIII,

    SGIII SGIII Sep 14, 2014 4:59 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 14, 2014 4:59 PM in response to mingsai

    They each return a valid boolean as written, including:

     

    c's id is greater than 19967 and c's id is less than 40909

     

    You need 'return' in the first two tests to exit the handler if the boolean is true.

    But you don't need to it in the last one. Whether it is true or false it will exit.

     

    SG

  • by SGIII,

    SGIII SGIII Sep 14, 2014 5:28 PM in response to mingsai
    Level 6 (10,782 points)
    Mac OS X
    Sep 14, 2014 5:28 PM in response to mingsai

    This is an AppleScript way to count the Chinese characters at the beginning of input:

     

    set tt to input's paragraphs's contents as text -- merge paragraphs

    set ww to tt's words -- split the text into words

    set CJKctr to 0

    repeat with i from 1 to count ww

      set w to ww's item i

      set cc to w's characters -- split a word into characters

      repeat with c in cc

         if isCJK(c) then

            set CJKctr to CJKctr + 1

         else

           return CJKctr

         end if

      end repeat

    end repeat

     

     

    on isCJK(c)

      --if c's id is greater than 131071 and c's id is less than 173783 then return true -- extension B (really rare)

      --if c's id is greater than 13311 and c's id is less than 19894 then return true -- 3400-4DB5 - extension A (rare)

      c's id is greater than 19967 and c's id is less than 40909 -- 4E00-9FCC - common CJK

    end isCJK

     

     

    SG

  • by Hiroto,

    Hiroto Hiroto Sep 16, 2014 1:38 PM in response to mingsai
    Level 5 (7,348 points)
    Sep 16, 2014 1:38 PM in response to mingsai

    Hello

     

    I have managed to invoke some of the undocumented functions of DictionaryServices.framework. The shell script listed below, which is a wrapper of RubyCocoa script, will transliterate Hanzi to pinyin by means of looking up specified dictionary.

     

    Key functions are DCSCopyRecordsForSearchString() and DCSRecordCopyData() which retrieve the structured representation (XHTML) of found entry that we can parse and extract the title and pronunciation cleanly.

     

    A usage example is as follows, provided that you have saved the script as /usr/local/bin/hanzi2pinyin (and chmod a+x it). Please see the source code for the details of options. The result will depend upon the dictionary used. It will try to use the longest query substring starting from the beginning of the current query to match some term in dictionary. If such query substring exists, script will lookup it up in the dictionary, output the result and update the current query to the remaining substring; otherwise script will give up the first character of the current query, output the not-found result (specified by -e option) for the character and update the current query to the remaining substring. Note some word (character) have multiple readings and in which case script will output the primary match followed by additional matches in parentheses. Specify the max record count to retrieve by -c option. The additional matches would be noises when the primary match is correct but the primary match is not necessarily correct. If you wish, you may specify -c1 to suppress additional matches.

     

     

    Usage e.g.

     

    #!/bin/bash
    
    # dictf='/Library/Dictionaries/Simplified Chinese - English.dictionary'
    # dictf='/Library/Dictionaries/The Standard Dictionary of Contemporary Chinese.dictionary'
    # dictf='/Library/Dictionaries/小词典.dictionary'
    # dictf='/Library/Dictionaries/小词典-繁体字.dictionary'
    dictf='/Library/Dictionaries/CC-CEDICT.dictionary'
    
    /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o0 -e -- '悟空捣鬼花果山'     # => 悟空[wùkōng] 捣鬼[dǎoguǐ] 花果山[huāguǒshān]
    /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o1 -e -- '悟空捣鬼花果山'     # => 悟空 捣鬼 花果山[wùkōng dǎoguǐ huāguǒshān]
    /usr/local/bin/hanzi2pinyin -d "$dictf" -c10 -o2 -e -- '悟空捣鬼花果山'     # => wùkōng dǎoguǐ huāguǒshān
    

     

     

    hanzi2pinyin

     

    #!/bin/bash
    
    hanzi2pinyin()
    {
        # 
        #   $@ = options query [query ...]
        #       -d, --dictionary DICTIONARY      Dictionary file.
        #       -c, --count COUNT                Max record count to retrieve (=10).
        #       -o, --output FORMAT              Output format (=0).
        #                                          0 = interleaved : H[p] H[p]...
        #                                          1 = separate    : H H...[p p...]
        #                                          2 = pinyin only : p p...
        #       -e, --echo [CHARACTER]           Character(s) to be echoed for no result.
        #                                        Given no CHARACTER, query is echoed.
        #       -h, --help                       Display this help.
        # 
        #   v0.33
        #   written by Hiroto, 2014-09
        # 
        /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - <(cat <<'BRIDGESUPPORT'
    <?xml version="1.0" standalone="yes"?>
    <!DOCTYPE signatures SYSTEM "file://localhost/System/Library/DTDs/BridgeSupport.dtd">
    <signatures version="0.9">
        <function name="DCSCopyTextDefinition">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type64="{_CFRange=qq}" type="{_CFRange=ii}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSGetTermRangeInString">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type64="q" type="l"></arg>
            <retval type64="{_CFRange=qq}" type="{_CFRange=ii}"></retval>
        </function>
        <function name="DCSDictionaryCreate">
            <arg type="^{__CFURL=}"></arg>
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSGetActiveDictionaries">
            <retval type="^{__CFArray=}"></retval>
        </function>
        <function name="DCSGetDefaultDictionary">
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSGetDefaultThesaurus">
            <retval type="^{__DCSDictionary=}"></retval>
        </function>
        <function name="DCSDictionaryGetURL">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFURL=}"></retval>
        </function>
        <function name="DCSDictionaryGetName">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSDictionaryGetIdentifier">
            <arg type="^{__DCSDictionary=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSCopyRecordsForSearchString">
            <arg type="^{__DCSDictionary=}"></arg>
            <arg type="^{__CFString=}"></arg>
            <arg type="l"></arg>
            <arg type="l"></arg>
            <retval type="^{__CFArray=}"></retval>
        </function>
        <function name="DCSRecordGetHeadword">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetString">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetRawHeadword">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetTitle">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetAnchor">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFString=}"></retval>
        </function>
        <function name="DCSRecordGetDataURL">
            <arg type="^{__DCSRecord=}"></arg>
            <retval type="^{__CFURL=}"></retval>
        </function>
        <function name="DCSRecordCopyData">
            <arg type="^{__DCSRecord=}"></arg>
             <arg type="l"></arg>
           <retval type="^{__CFString=}"></retval>
        </function>
    </signatures>
    BRIDGESUPPORT) "$@"
    # -----------------------------------------------------
    #   * some DictionaryServices functions
    # 
    #   (undocumented)
    #   
    #   extern CFArrayRef DCSGetActiveDictionaries (void)
    #   extern DCSDictionaryRef DCSGetDefaultDictionary (void)
    #   extern DCSDictionaryRef DCSGetDefaultThesaurus (void)
    #   extern DCSDictionaryRef DCSDictionaryCreate (CFURLRef)
    #   extern CFURLRef DCSDictionaryGetURL (DCSDictionaryRef)
    #   extern CFStringRef DCSDictionaryGetName (DCSDictionaryRef)
    #   extern CFStringRef DCSDictionaryGetIdentifier (DCSDictionaryRef)
    #   
    #   extern CFArray DCSCopyRecordsForSearchString (DCSDictionaryRef, CFStringRef, unsigned long long, long long)
    #       unsigned long long method
    #           0   = exact match
    #           1   = forward match (prefix match)
    #           2   = partial query match (matching (leading) part of query; including ignoring diacritics, four tones in Chinese, etc)
    #           >=3 = ? (exact match?)
    #       
    #       long long max_record_count
    # 
    #   extern CFStringRef DCSRecordGetString (DCSRecordRef) 
    #   extern CFStringRef DCSRecordGetHeadword (DCSRecordRef) 
    #   extern CFStringRef DCSRecordGetRawHeadword (DCSRecordRef) 
    #   extern CFStringRef DCSRecordGetTitle (DCSRecordRef) 
    #   extern CFStringRef DCSRecordGetAnchor (DCSRecordRef) 
    #   extern CFURLRef DCSRecordGetDataURL (DCSRecordRef) 
    # 
    #   extern CFStringRef DCSRecordCopyData (DCSRecordRef, long)
    #       long output_style
    #           0 = XML XHTML <html> string
    #           1 = XML XHTML <html> string
    #           2 = XML XHTML <html> string
    #           3 = plain text
    #           4 = XML XHTML <text> string (single element)
    #       * corresponding to (?)
    #           Transform.xsl
    #           TransformApp.xsl
    #           TransformPanel.xsl
    #           TransformSimpleText.xsl
    #           TransformText.xsl
    # 
    #   (documented)
    #   
    #   CFStringRef DCSCopyTextDefinition (DCSDictionaryRef, CFStringRef, CFRange)
    #   CFRange DCSGetTermRangeInString (DCSDictionaryRef, CFStringRef, CFIndex)
    # 
    # -----------------------------------------------------
    
        # 
        #   ARGV[0] = fixed and extended bridge support file for DictionaryServices.framework
        #   ARGV[1..N] = options and query word(s)
        # 
        require 'osx/cocoa'
        include OSX
        # OSX.require_framework '/System/Library/Frameworks/CoreServices.framework/Frameworks/DictionaryServices.framework'     # [1]
        OSX.load_bridge_support_file ARGV.shift     # [2]
        
        def parse_options(argv)
            require 'optparse'
            args = {
                :dictf  => nil,
                :count  => 10,
                :output => 0,
                :echo   => '',
            }
            op = OptionParser.new do|o|
                o.banner = "Usage: #{File.basename($0)} options query [query ...]"      
                o.on('-d', '--dictionary DICTIONARY', String, "Dictionary file.") do |f|
                    args[:dictf] = f
                end
                o.on('-c', '--count COUNT', Integer, "Max record count to retrieve (=10).") do |i|
                    raise OptionParser::InvalidArgument, i unless i.to_i > 0
                    args[:count] = i.to_i
                end
                o.on('-o', '--output FORMAT', Integer, "Output format (=0).", 
                    "  0 = interleaved : H[p] H[p]...", 
                    "  1 = separate    : H H...[p p...]", 
                    "  2 = pinyin only : p p...") do |i|
                    raise OptionParser::InvalidArgument, i unless [0, 1, 2].include?(i.to_i)
                    args[:output] = i.to_i
                end
                o.on('-e', '--echo [CHARACTER]', String, "Character(s) to be echoed for no result.",
                    "Given no CHARACTER, query is echoed.") do |s|
                    args[:echo] = s || ''
                end
                o.on( '-h', '--help', 'Display this help.' ) do
                    $stderr.puts o; exit 1
                end
            end
            begin
                op.parse!(argv)
            rescue => ex
                $stderr.puts "#{ex.class} : #{ex.message}"
                $stderr.puts op.help(); exit 1
            end
            if argv.length == 0
                $stderr.puts op.help(); exit 1
            end
            args
        end
    
        args = parse_options(ARGV)
        if (dctf = args[:dictf])
            unless File.exists?(dctf)
                $stderr.puts "No such dictionary: %s" % dctf
                exit 1
            end
            url = NSURL.fileURLWithPath(dctf)
            dct = DCSDictionaryCreate(url)
            unless dct
                $stderr.puts "Failed to create dictionary object from: %s" % dctf
                exit 2
            end
        else
            dct = DCSGetDefaultDictionary()
            unless dct
                $stderr.puts "Failed to obtain default dictionary"
                exit 2
            end
        end
    
        QUERY_METHOD = 0                # exact match
        MAX_RECORD_COUNT = args[:count] # max record count to be retrieved
        OUTPUT_FORMAT = args[:output]   # output format option
                                        #   0 = interleaved : H[p] H[p]...
                                        #   1 = separate    : H H...[p p...]
                                        #   2 = pinyin only : p p...
                                        #
                                        # e.g., given query '我的母亲'
                                        #   0 => 我[wǒ] 的[de(dī,dí,dì)] 母亲[mǔqīn]
                                        #   1 => 我 的 母亲[wǒ de(dī,dí,dì) mǔqīn]
                                        #   2 => wǒ de(dī,dí,dì) mǔqīn
    
        TRIM_CHARS = "\t\n |"           # characters to be trimmed at both ends of pronunciation string
        ECHO_QUERY = ''                 # special character to let it echo query if result is not found
        ECHO_CHAR = args[:echo]         # character(s) to be echoed if no result is found for query
                                        # if ECHO_QUERY is specified, query string is echoed for no result
        
        TRIM_CHARS_SET = NSCharacterSet.characterSetWithCharactersInString(TRIM_CHARS)
        ECHO_NS = ECHO_CHAR.to_ns
    
        ARGV.map {|a| a.to_ns }.each do |q|     # [3]
            dd = []
            while true do
                # 
                #   Until given query string (q) is exhausted, repeat as follows -
                #     get longest leading substring (qu) of the query string matching a term in dictionary,
                #     look the substring up in dictionary and retrieve title and pronunciation of the matching entry.
                # 
                u = DCSGetTermRangeInString(dct, q, 0)              # try to find longest leading range matching a term in dictionary
                u = NSMakeRange(0, 1) if u.location == KCFNotFound  # fallback [4]
                qu = q.substringWithRange(u)
                rr = DCSCopyRecordsForSearchString(dct, qu, QUERY_METHOD, MAX_RECORD_COUNT)
                unless rr
                    c = q.substringWithRange(NSMakeRange(0, 1))     # give up one character at the beginning
                    dd << [[c, ECHO_CHAR == ECHO_QUERY ? c : ECHO_NS]]
                    break if q.length < 2
                    q = q.substringFromIndex(1)
                else
                    tt, pp = [], {}
                    rr.each do |r|  # r = DCSRecordRef
                        # 
                        #   parse xml representation of record entry to get title and pronunciation
                        # 
                        xml = DCSRecordCopyData(r, 0)
                        err = OCObject.new
                        doc = NSXMLDocument.alloc.objc_send(
                            :initWithXMLString, xml,
                            :options, 0,
                            :error, err)
                        unless doc
                            $stderr.puts "Failed to obtain XML document for %s: %s" % [qu, err.description]
                            next
                        end
                        nn = doc.objc_send(
                            :nodesForXPath, '//d:entry/@d:title',   # d:title attribute
                            :error, nil)
                        title = nn && nn == [] ? ECHO_NS : nn.first.stringValue
                        nn = doc.objc_send(
                            :nodesForXPath, '//d:entry//span[@d:pr]',   # span element with d:pr attribute
                            :error, nil)
                        pron = nn && nn == [] ? ECHO_NS : nn.first.stringValue
                        pron = pron.stringByTrimmingCharactersInSet(TRIM_CHARS_SET).
                            stringByReplacingOccurrencesOfString_withString(' ', '').lowercaseString
                        
                        tt << title unless tt.include?(title)
                        title_s = title.to_s    # for use as hash key in ruby
                        if not pp.key?(title_s)
                            pp[title_s] = [pron]
                        elsif not pp[title_s].include?(pron)
                            pp[title_s] << pron
                        end
                    end
                    # 
                    #   Let query_{k} denote sub-query for k-th substring defined by range u,
                    #       title_{k,i} denote i-th found title for query_{k},
                    #       pron_{k,i,j} denote j-th pronunciation for title_{k,i};
                    # 
                    #   array cc_k holds each collection of pronunciations per tile_{k,i} found for query_{k}:
                    #       cc_k    = [c_{k,1}, c_{k,2}, ...]
                    #       c_{k,i} = [ title_{k,i},  pron_{k,i,1} *1( '(' pron_{k,i,2} ',' pron_{k,i,3} ',' ... ')' ) ]
                    # 
                    #   array dd holds list of cc_k for every sub-query_{k}
                    #       dd  = [cc_1, cc_2, ...]
                    # 
                    cc_k = tt.map do |t|
                        a = pp[t.to_s]
                        [t,  a.shift + (a == [] ? '' : "(%s)" % a.join(','))]
                    end
                    dd << cc_k
    
                    k = u.location + u.length
                    break unless k < q.length
                    q = q.substringFromIndex(k)
                end
            end
    
            case OUTPUT_FORMAT
            #   0 = interleaved : H[p] H[p]...
            #   1 = separate    : H H...[p p...]
            #   2 = pinyin only : p p...
            when 0
                ee = dd.map do |cc|
                    next '' if cc == []
                    ("%s[%s]" % cc.shift) + (cc == [] ? '' : "(%s)" % cc.map {|c| "%s[%s]" % c}.join(','))  
                end
                puts ee.join(' ')
            when 1
                aa = dd.map do |cc|
                    a, b = cc.transpose
                    next '' unless a
                    (a.shift) + (a == [] ? '' : "(%s)" % a.join(','))   
                end
                bb = dd.map do |cc|
                    a, b = cc.transpose
                    next '' unless b
                    (b.shift) + (b == [] ? '' : "(%s)" % b.join(','))   
                end
                puts "%s[%s]" % [aa.join(' '), bb.join(' ')]
            when 2
                bb = dd.map do |cc|
                    a, b = cc.transpose
                    next '' unless b
                    (b.shift) + (b == [] ? '' : "(%s)" % b.join(','))   
                end
                puts bb.join(' ')
            end
        end
        # 
        #   [1] DictionaryServices.framework/Resources/BridgeSupport/DictionaryServices.bridgesupport has problem to be fixed.
        #       I.e., in signatures of DCSCopyTextDefinition(), DCSGetTermRangeInString() function etc,
        #           {??=qq} should have been {_CFRange=qq}
        #           {??=ii} should have been {_CFRange=ii}
        #   [2] Fixed and extended bridgesupport file is loaded by OSX.load_bridge_support_file.
        #       It now includes signatures for several undocumented functions as well.
        #   [3] argv.to_ns is required to handle unicode characters correctly (in ruby 1.8).
        #   [4] DCSGetTermRangeInString(dct, q, 0) returning range [KCFNotFound, 0] does not necessarily mean q's 1st character
        #       as query may not match any term in dictionary.  It is necessary to use DCSCopyRecordsForSearchString() 
        #       for the 1st character in order to know the (existence of) matching term(s).
        # 
    EOF
    }
    
    hanzi2pinyin "$@"
    

     

     

     

    Tested under 10.6.8. As I used undocumented functions, they may have changed in later versions, which will break the script.

     

    Hope this may help,

    H

  • by Hiroto,

    Hiroto Hiroto Sep 16, 2014 2:35 PM in response to mingsai
    Level 5 (7,348 points)
    Sep 16, 2014 2:35 PM in response to mingsai

    Hello

     

    Here's another tool. It is much simpler and easier. BUT... if only it is more accurate in hanzi to pinyin transliteration.

     

    The key function is CFStringTransform() which implements ICU transliteration. The problem is that ICU Han-Latin transliteration is per character basis and thus may yield really wrong result when a chararter has multiple readings in context. It is too inaccurate to be used as a learning tool.

     

    cf.

    http://userguide.icu-project.org/transforms/general

     

     

    However, Traditional-Simplified and Simplified-Traditional transliterators will be handy and useful.

     

     

    #!/bin/bash
    
    hanzi_trans()
    {
        # 
        #   $@ = <ICU Transliterator ID> string [string ...]
        #   
        /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - "$@"
        require 'osx/cocoa'
        include OSX
        icu_tr = ARGV.shift;
        ARGV.map {|a| a.to_ns}.each do |q|
            CFStringTransform(q, nil, icu_tr, false)
            puts q
        end
    EOF
    }
    
    hanzi_trans 'Han-Latin' '悟空搗鬼花果山'       # => wù kōng dǎo guǐ huā guǒ shān  (CORRECT)
    hanzi_trans 'Han-Latin' '大地的女儿'          # => dà de de nǚ r  (WRONG)
    
    hanzi_trans 'Simplified-Traditional' '简体字' '繁体字'    # => 簡體字 \n 繁體字
    hanzi_trans 'Traditional-Simplified' '簡體字' '繁體字'    # => 简体字 \n 繁体字
    
    # 
    #   * Note that "Han-Latin" ICU transliteration seems to be done per Han character and thus may yield wrong result.
    #   
    #   E.g.
    # 
    #   CORRECT
    #       天地 = tiāndì
    #       地球 = dìqiú
    #       大地的女儿 = dàdì de nǚ'ér
    # 
    #   WRONG (ICU transliteration result)
    #       天地 = tiān de
    #       地球 = de qiú
    #       大地的女儿 = dà de de nǚ r
    # 
    

     

     

    Regards,

    H

  • by SGIII,

    SGIII SGIII Sep 17, 2014 12:27 PM in response to Hiroto
    Level 6 (10,782 points)
    Mac OS X
    Sep 17, 2014 12:27 PM in response to Hiroto

    Hi H,

     

    Nice work!  I would like to try your script in 10.9 (Mavericks) but am a little wobbly on installing scripts and getting them to work.

     

    I think I've found a way to unhide /user/local/bin/ (in Mavericks you can't just show up in Finder > Go > Go to Folder... and type it in).

     

    Then what do I do?

     

    Save the script just as hanzi2pinyin?  Or do I need a specific file extension such as .rb?

     

    How do I then chmod a+x it?

     

    And how would I call it from AppleScript do shell script?  Pipe in as stdin?

     

    Thanks,

     

    SG

Previous Page 2 of 4 last Next