8 Replies Latest reply: Mar 9, 2013 2:20 AM by Hiroto
Queerly Level 1 Level 1 (0 points)

Hi,

 

I've been dabbling in AppleScript for a few months to automate the mundane process of compiling learning materials for my Chinese studies. Now, I'm trying to write a script that would enable me to automatically search for example sentences of Chinese vocabs on a free Internet site.

 

What I've managed to do so far is to make AppleScript open a list of vocabs from a text file. Then I usea repeat-loop for each single vocab to open the dictionary's page, use JavaScript to input the vocab into the search field and press the search button. That part works fine, but where I'm stuck is the bit where I'm supposed to get the results the site returns. These are not returned in the HTML code itself, but rather through the DOM structure. I only have a vary vague idea of how DOM works and my attempts so far have not done any good.

 

This is how I tried to manipulate Javascript

 

 

set DelayTimeLong to 1 on LoadURL(theURL, DelayTimeLong)           global SafariID           tell application "Safari"                     set URL of document 1 of window id SafariID to theURL   delay DelayTimeLong                     repeat   ----use Safari's 'do JavaScript' to check a page's status                               if (do JavaScript "document.readyState" in document 1 of window id SafariID) is "complete" then exit repeat                               delay 1                     end repeat           end tell end LoadURL - initiate safar:::::::::::::::: tell application "Safari"   activate   make new document with properties {URL:"http://google.com"}             set SafariID to id of window 1 -- store the window ID end tell set NewWord to "汉语" set theURL to "http://www.trainchinese.com/v1/a_user/index.php"   LoadURL(theURL, DelayTimeLong)   delay 3             tell application "Safari"                     do JavaScript "document.getElementById('searchWord').value='" & NewWord & "'" in document 1 of window id SafariID   delay 3                     do JavaScript "document.getElementById('srchEnglishBtn').click()" in document 1 of window id SafariID                        set SearchResults to do JavaScript "MySearchResults = document.getElementById('searchresultWindow').getDOM" in document 1 of window id SafariID                     set SearchResults to do JavaScript "MySearchResults = document.documentElement.innerHTML" in document 1 of window id SafariID                       return SearchResults             end tell

 

The results are contained hierarchically within the element called with the ID searchresultWindow so that's where the first line is from. The second line was an attempt to get the entire DOM structure, in the hope of being able to search through it, but that failed too. What I got was just the Source Code of the Weppage itself.

 

Solutions to the problem would be appreciated. So would any prods in the right direction.

 

Cheers!


MacBook Pro, Mac OS X (10.7.4)
  • Hiroto Level 5 Level 5 (5,675 points)

    Hello

     

    You may try something like the following script.

    It will query the given word and return the contents of search result table as plain text of rows of tab delimited fields. Extraction is done by javascript and DOM.

     

    --applescript
    set query to "汉语"
    set r to query_word(query)
    
    --set the clipboard to r
    return r
    
    
    on query_word(query)
        script o
            property _url : "http://www.trainchinese.com/v1/a_user/index.php"
            
            property js0 : "// document ready?
    document.readyState == 'complete';
    "
            property js1 : "// query word
    var query = '" & query & "';
    document.getElementById('searchWord').value = query;
    document.getElementById('srchEnglishBtn').click();
    "
            property js2 : "// table ready?
    var div = document.getElementById('searchresultWindow');
    div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
    "
            property js3 : "// collect text from result table
    function main() 
    {
        var div = document.getElementById('searchresultWindow');
        var table = div.firstChild;                 // table element
        var trs = table.rows;                       // tr elements collection
        
        var rr = [];                                // tr text data array
        for (var i = 0; i < trs.length; ++i)
        {
            var tr = trs.item(i);                   // tr element
            var tds = tr.childNodes;                // td elements collection
            var dd = [];                            // td text data array
            for (var j = 0; j < tds.length; ++j)
            {
                var td = tds.item(j);               // td element
                dd.push(collectText(td));
            }
            rr.push( dd.join('\\t') );              // tds text delimited by tab
        }
        return rr.join('\\n');                      // trs text delimited by linefeed
    }
    function collectText(td)
    //    Node td : HTML td element
    //    return string : text data extracted from td element
    {
        if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
        
        var tt = [];
        for (var i = 0; i < td.childNodes.length; ++i)
        {
            tt.push( collectText( td.childNodes.item(i) ) );
        }
        return tt.join('');
    }
    main();
    "
            tell application "Safari"
                -- (0) get reference of search window (tab)
                -- get existing window (tab) if present
                set _tabs to tab 1 of windows whose URL = _url
                set _tab to missing value
                repeat with t in _tabs
                    set t to t's contents
                    if t is not missing value then
                        set _tab to t
                        exit repeat
                    end if
                end repeat
                -- create new window (tab) if not present
                if _tab = missing value then
                    make new document with properties {URL:_url}
                    delay 1 -- for some reason, some delay may be required here.
                    set _tab to window 1's current tab
                    -- wait till document ready
                    repeat until (do JavaScript js0 in _tab)
                        delay 0.5
                    end repeat
                end if
                --return _tab
                
                -- (1) query word
                do JavaScript js1 in _tab
                
                -- (2) wait till result table ready
                repeat until (do JavaScript js2 in _tab)
                    delay 0.5
                end repeat
                
                -- (3) collect text from result window
                return do JavaScript js3 in _tab
            end tell
        end script
        tell o to run
    end query_word
    --end of applescript
    

     

    Please be nice to the site and do not abuse it with intensive automated searches.

     

    Regards,

    H

  • Queerly Level 1 Level 1 (0 points)

    Hi Hiroto,

     

    first of all, let me assure you that you do not have to worry about abuses, since the cababilities of human brain set a very clear search limit. There's only so many words a week you can study. I might want to make a couple dozen searches at first to catch up on my uni workload, but on the whole, there shouldn't be more than 10-15 search words a week.

     

    Secondly, let me say WOW! I had expected a hint here and there, a prod in the general direction, maybe even a complete aswer to my question. I did not expect to find a complete solution! And this script runs so moothly and so fast. And it delivers the results delimitere by tabs, so that I can neatly format the information into a flashcard field. I can't really tell you how greatful I am.

     

    The only drawback to this approach is I can't make head or tail of what the Java commands all do, or rather I know what they do (thank you for the commentaries), but not how they do it. I'm terribly ignorant of JavaScript.

     

    There are, however, two questions I like to ask:

     

    1) Is there a way to make the script make sure that all results are displayed before retrieving the table. By default the table shows 20 resutls, but if more sentences are in the database, a "Get more results" button is displayed.

     

    2) Secondly, the first position in line is a sign that lets you open each entry separately. Is there any way to get the id for this object so that I could open each card in turn and extract the mp3 file with pronunciations saving it under the entry's name? I was considering adding each sentence as a separate card in my flashcard programme to practice pronunciation and word usage at the same time. Chinese is an awfully difficult language for native speakers of European languages (not that you would know , from my experience the Japanese are brilliant at Chinese). Having the audio file in my flashcard programme would be a great help. In fact, the site has their own free flashcard programme / iPhone app, but unfortunately the space repetition algorythym is terrible and not very productive, so in the end I stopped using it. Hence this attempt.

     

    Thanks again,

     

    Cheers!

  • Hiroto Level 5 Level 5 (5,675 points)

    Hello

     

    Glad to know it helped somehow.

    As far as I have browsed the relevant parts of the html source of the site, I think both of your requests can be done. However, it'll take me some time to revise the script, for I have things to do. In a few days, maybe.

     

    And yes, we Japanese are familiar with (traditional) Chinese characters because we use (a variant of) them in every life. But most of us cannot speak Chinese or understand spoken Chinese because the pronounciation of character is completely different from that in Japanese. So I'm interested in collecting the pronounced phrases as well.

     

    Later,

    H

  • Queerly Level 1 Level 1 (0 points)

    Hi,

     

    Take your time, there's honestly no hurry. I have pleanty of other vocabs/material to study, and a BA thesis on the plate, too. That's why I was thinking of turning the time-consuming task of preparing flash-cards into an automated process. Your solution already gives me what I need to easily find example sentences and import them in my flashcard programme. As for the getting the audio files and import the phrases/sentences as separate cards, that'd be great, but it can definitely wait a few days or even a few weeks if need be. Your time, your call. But I really hope you will find some time some day to write the script.

     

    I'm not sure how you want to use the material. For my purposes, in addition to downloading the MP3s, I planned on to save the the data for each sentence/phrase in a tab delimitered .txt file that I can then import into the flashcard programme. I thought on getting all the info available on the site (plus writing down where the sentence/phrase comes from - I noticed this helps me remember the word better),i.e., save it a text file like this:

     

    Simplified_Chinese & tab & Traditional_Chinese & tab & Pinyin & tab & English_Translation & tab & "TrainChinese" & return

     

    Not sure if that would work for you.

     

    On a different note, how long have you been learning Chinese? What level are you at? Any resources you could recomend? I know this is not the right forum for it, so feel free to PM me.

     

    Cheers!

     

    Queerly

  • Hiroto Level 5 Level 5 (5,675 points)

    Hello

     

    Sorry for late reply.

    Script listed below is the first revised version to collect all results by pressing 'get more results' button.

    Retrieving mp3 files is yet to be done. Clicking every entry in result table seems to be the only way to get the url of corresponding mp3 file. I'm going to examine it more closer.

     

    set q to "汉语"
    set r to query_word(q)
    --set the clipboard to r
    return r
    
    on query_word(query)
        script o
            property _url : "http://www.trainchinese.com/v1/a_user/index.php"
            property delta : 0.2
            
            property js0 : "// document ready?
    document.readyState == 'complete';
    "
            property js1 : "// query word
    var query = '" & query & "';
    document.getElementById('searchWord').value = query;
    document.getElementById('srchEnglishBtn').click();
    "
            property js2 : "// table ready?
    var div = document.getElementById('searchresultWindow');
    div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
    "
            
            property js3 : "// collect text from result table
    function main() 
    {
        var div = document.getElementById('searchresultWindow');
        var table = div.firstChild;                 // table element
        var trs = table.rows;                       // tr elements collection
        
        var rr = [];                                // tr text data array
        for (var i = 0; i < trs.length; ++i)
        {
            var tr = trs.item(i);                   // tr element
            var tds = tr.childNodes;                // td elements collection
            var dd = [];                            // td text data array
            for (var j = 0; j < tds.length; ++j)
            {
                var td = tds.item(j);               // td element
                dd.push(collectText(td));
            }
            rr.push( dd.join('\\t') );              // tds text delimited by tab
        }
        return rr.join('\\n');                      // trs text delimited by linefeed
    }
    function collectText(td)
    //    Node td : HTML td element
    //    return string : text data extracted from td element
    {
        if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
        
        var tt = [];
        for (var i = 0; i < td.childNodes.length; ++i)
        {
            tt.push( collectText( td.childNodes.item(i) ) );
        }
        return tt.join('');
    }
    main();
    "
            on _js2a(x)
                "// click 'get more results' button with given id x if it exists
    var tr = document.getElementById('" & x & "');
    if (tr != null) { tr.firstChild.firstChild.click(); }
    tr != null;
    "
            end _js2a
            
            on _js2a1(x)
                "// check if 'get more results' button with given id x no longer exists
    var tr = document.getElementById('" & x & "');
    tr == null;
    "
            end _js2a1
            
            tell application "Safari"
                
                -- (0) get reference of search window (tab)
                -- get existing window (tab) if present
                set _tabs to tab 1 of windows whose URL = _url
                set _tab to missing value
                repeat with t in _tabs
                    set t to t's contents
                    if t is not missing value then
                        set _tab to t
                        exit repeat
                    end if
                end repeat
                -- create new window (tab) if not present
                if _tab = missing value then
                    make new document with properties {URL:_url}
                    delay 1 -- for some reason, some delay is required here in some cases
                    set _tab to window 1's current tab
                    -- wait till document ready
                    repeat until (do JavaScript js0 in _tab)
                        delay delta
                    end repeat
                end if
                --return _tab
                
                -- (1) query word
                do JavaScript js1 in _tab
                
                -- (2) wait till result table ready
                repeat until (do JavaScript js2 in _tab)
                    delay delta
                end repeat
                
                -- (2a) get more results
                (*
                    'get more results' button in in a tr element with id such as -
                        sr_more0, sr_more20, sr_more40, sr_more60, ...
                *)
                repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                    set id_ to "sr_more" & i
                    if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
                        -- wait till result table ready
                        repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                            delay delta
                        end repeat
                    else -- more button with this id_ not exist
                        exit repeat
                    end if
                end repeat
                
                -- (3) collect text from result window
                return do JavaScript js3 in _tab
            end tell
        end script
        tell o to run
    end query_word
    

     

     

    Well, my level of spoken Chinese is virtually zero so far. However, I understand written Chinese rather well. And indeed I often read very old Chinese translations of Sanskrit text in Buddhism, which were mainly brought about in 3rd through 8th centuries. You might think it strange that one who cannot catch spoken words can comprehend written text. But it is the magic of ideograph.

     

    That's all for now.

    Hiroto

  • Hiroto Level 5 Level 5 (5,675 points)

    Hello

     

    My homework is almost done.

    Close examination of the html source of the site revealed that we can get the sound urls without clicking each phrase and opening its sub window in result table. The urls are passed as function's argument which is called as onclick handler of each row in table. So we can parse and retrieve the urls directly from the table. This helps really.

     

    Script below will retrieve data from result table in the form as -

     

    [] \t [simplified Chinese] \t [traditional Chinese] \t [pinyin] \t [English translation] \t [pronunciation url]
    

     

    The first empty string [] comes from null-text td element of each row in table, which is for [+] mark. Some entries do not have [pronunciation url] in which case it will be empty string in the result. Urls are relative to the search page.

     

    The current script does not download mp3 files because there are ways to organise them locally and I'm yet to decide.

    E.g.

    a) name the file as [Chinese phrase].mp3 and put it in a [pronunciation] directory; or,

    b) reproduce the original directory structure locally and there put the file with the original name.

     

    I'm inclined to choose a), for it is simple to use and maintain. One problem of a) would be that file name can be a long Chinese string.

     

    You may decide the scheme yourself and download the mp3 files. All the necessary information are collected by this script.

    I'm going to implement my downloading script later, probably in next week.

     

     

    set q to "汉语"
    set r to query_word(q)
    --set the clipboard to r
    return r
    
    on query_word(query)
        script o
            property _url : "http://www.trainchinese.com/v1/a_user/index.php"
            property delta : 0.2
            
            property js0 : "// document ready?
    document.readyState == 'complete';
    "
            property js1 : "// query word
    var query = '" & query & "';
    document.getElementById('searchWord').value = query;
    document.getElementById('srchEnglishBtn').click();
    "
            property js2 : "// table ready?
    var div = document.getElementById('searchresultWindow');
    div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
    "
            on _js2a(x)
                "// click 'get more results' button with given id x if it exists
    var tr = document.getElementById('" & x & "');
    if (tr != null) { tr.firstChild.firstChild.click(); }
    tr != null;
    "
            end _js2a
            
            on _js2a1(x)
                "// check if 'get more results' button with given id x no longer exists
    var tr = document.getElementById('" & x & "');
    tr == null;
    "
            end _js2a1
            
            property js3 : "// collect text and mp3 urls from result table
    function main() 
    {
        var div = document.getElementById('searchresultWindow');
        var table = div.firstChild;                 // table element
        var trs = table.rows;                       // tr elements collection
        
        var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
        var rr = [];                                // tr text data array
        for (var i = 0; i < trs.length; ++i)
        {
            var tr = trs.item(i);                   // tr element
            var f = tr.getAttribute('onclick');     // f = onclick function string
            var m;                                  // m = regexp match result array
            var url =                               // mp3 url
                (f != null && (m = f.match(re)) != null)
                    ? m[0] 
                    : '';
        
            var tds = tr.childNodes;                // td elements collection
            var dd = [];                            // td text data array
            for (var j = 0; j < tds.length; ++j)
            {
                var td = tds.item(j);               // td element
                dd.push(collectText(td));
            }
            dd.push(url);                           // append url to td text data array
            rr.push( dd.join('\\t') );              // tds data delimited by tab
        }
        return rr.join('\\n');                      // trs text delimited by linefeed
    }
    function collectText(td)
    //    Node td : HTML td element
    //    return string : text data extracted from td element
    {
        if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
        
        var tt = [];
        for (var i = 0; i < td.childNodes.length; ++i)
        {
            tt.push( collectText( td.childNodes.item(i) ) );
        }
        return tt.join('');
    }
    main();
    "
            tell application "Safari"
                
                -- (0) get reference of search window (tab)
                -- get existing window (tab) if present
                set _tabs to tab 1 of windows whose URL = _url
                set _tab to missing value
                repeat with t in _tabs
                    set t to t's contents
                    if t is not missing value then
                        set _tab to t
                        exit repeat
                    end if
                end repeat
                -- create new window (tab) if not present
                if _tab = missing value then
                    make new document with properties {URL:_url}
                    delay 1 -- for some reason, some delay is required here in some cases
                    set _tab to window 1's current tab
                    -- wait till document ready
                    repeat until (do JavaScript js0 in _tab)
                        delay delta
                    end repeat
                end if
                --return _tab
                
                -- (1) query word
                do JavaScript js1 in _tab
                
                -- (2) wait till result table ready
                repeat until (do JavaScript js2 in _tab)
                    delay delta
                end repeat
                
                -- (2a) get more results
                (*
                    'get more results' button in in a tr element with id such as -
                        sr_more0, sr_more20, sr_more40, sr_more60, ...
                *)
                repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                    set id_ to "sr_more" & i
                    if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
                        -- wait till result table ready
                        repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                            delay delta
                        end repeat
                    else -- more button with this id_ not exist
                        exit repeat
                    end if
                end repeat
                
                -- (3) collect text and mp3 urls from result window
                return do JavaScript js3 in _tab
            end tell
        end script
        tell o to run
    end query_word
    

     

    Kind regards,

    H

  • Queerly Level 1 Level 1 (0 points)

    Hi Hiroto,

     

    This is mighty. I'd never come up with that, simply don't know enough JavaScript. I found, however, one problem with the script in the current form. It works fine provided the search term is in the dictionary. If not, it'd be locked in an infinite loop waiting for the results which do not exist to load. It'd be great if you could find the time to revise the script to check whether any results are returned. I can't do it myself, cos the script is too complex for my understanding.

     

    However, where I could help is getting the files to be downloaded. I wrote the script that will save the mp3 in a separate Audio folder (the subfolder must exist in the output directory, I didn't know how to create it per shellscript, so I just did it manually) under their Chinese names. It will also allow you to specify a Suffix to the file name (you may leave it blank if you don't want to, but I use it to distinguish between files pronounced by male and female speakers - the database is all female).

     

    The script that will also perform a search on a number of new words from a text file with one word per line. It will save the results to two tab delimited text files, one with entries with audio and a second one with entries without audio. It will also extract the part of speach information into a separate field and save these as well. Additionaly, the textfile for entries with audio files will have an extra collumn with the following entry "[sound:" & Filename.mp3 & "]", cos that's the tag my flashcard uses to incorporate audio files. There is also an extra check column with the status of the download e.g. success or failed. If you find "failed" in the textfile for entries without audio, then it means the script failed at extracting the proper URL for the mp3. That did not happen with me so far.

     

    Oh yeah, there's a dialog box asking you to specyfy delay time between separate downloads of mp3. It's to simulate more human-like behaviour so that the server does not feel under attack. The default time is 5 sec.

     


    Also, more importantly, the script checks the audio folder for already existing files and will check before downloading an mp3 if it is not already downloaded so that duplicate files will be skipped. It's to protect the server as well.

     

    There's also an extra file being generated, called !Counter.txt. It's there to indicate the progress of the current task and to help you find out what was the last successfullly processed item in case the script crashes or the internet connection drops.

     

    Sorry if the script seems a bit messy. I'm not very experience with programming. I'm sure you could do a much neater job out of it. But I hated the fact that you end up doing all the work while I'm just sitting here waiting for the end result like a spoilt brat. I wanted to help out a bit too.

     

    So, if you can correct the infinite loop problem and maybe add the automatic generation of "Audio" subfolder in the output folder if it does not alrady exist there, I think we'd be done with the script. Though, of course, feel free to improve the script even further.

     

     

    set NewWordList to paragraphs of (read (choose file with prompt "Open List of new files") as «class utf8»)
    set OutputFolderPath to (choose folder with prompt "Open Output Folder")
    set OutputAudioPath to OutputFolderPath & "Audio:" as string
    set OutputLogFilePath to OutputFolderPath & "!Search_Results_with_Audio.txt" as string
    set SkippedLogFilePath to OutputFolderPath & "!Search_Results_NO_Audio.txt" as string
    set CounterLogFilePath to OutputFolderPath & "!!!Progress_Counter.txt" as string
    set PosixOutputAudioPath to POSIX path of OutputAudioPath
    --return PosixOutputAudioPath
    do shell script "mkdir \"$/Users/chinskycraze/Desktop/!NCIKU_Dictionaries/Output/Audio\""
    set StartItem to text returned of (display dialog "Set start item to:" default answer 1)
    
    set DelayTime to text returned of (display dialog "In order to prevent misuse of the Online Dictionary and make this search seem more natural to the administrator of the domane, please choose the length of delay (in sec) between downloading each audio file." default answer 5)
    
    set FileNameSuffix to text returned of (display dialog "Each audio file will be automatically named after the Chinese entry, but you may choose to add as suffix at the end of the filename of each file? What suffix should be added? (Leave out blank, if you do not want to modify the name at all.)" default answer "_f")
    
    set DownloadedCache to (list folder OutputAudioPath without invisibles)
    repeat with i from 1 to count of DownloadedCache
              set item i of DownloadedCache to replaceString("_f.mp3", "", item i of DownloadedCache)
    end repeat
    set startTime to do shell script "date +%s"
    set CounterLimit to count of NewWordList
    
    repeat with i from StartItem to CounterLimit
              set NewWord to item i of NewWordList
              set query_results to query_word(NewWord)
      
      download_results(query_results)
              set CounterEntry to CompileCounterEntry(i)
      WriteLogEntry(CounterEntry, CounterLogFilePath)
    end repeat
    
    on download_results(query_results)
              global FileNameSuffix, OutputFolderPath, OutputAudioPath, FileNameSuffix, DelayTime, OutputLogFilePath, SkippedLogFilePath, DownloadedCache
              set SearchList to paragraphs of replaceString("           ", "", query_results)
              repeat with i from 1 to count of SearchList
                        set currentItem to item i of SearchList
                        if currentItem is not "" then
                                  if currentItem is not "          Chinese          Pinyin          Translation          " then
                                            set SimplifiedChinese to ""
                                            set TraditionalChinese to ""
                                            set PinyinEntry to ""
                                            set AudioDownloadURL to ""
                                            set TranslationEntry to ""
                                            set PartofSpeach to ""
                                            set DownloadStatus to ""
                                            set FileName to ""
                                            set tid to AppleScript's text item delimiters -- save them for later
                                            set AppleScript's text item delimiters to tab
                                            set currentItemCont to text items of currentItem
                                            set SimplifiedChinese to item 1 of currentItemCont
                                            set TraditionalChinese to item 2 of currentItemCont
                                            set PinyinEntry to item 3 of currentItemCont
                                            set TranslationEntry to item 4 of currentItemCont
                                            set AudioDownloadURL to last item of currentItemCont
                                            set AppleScript's text item delimiters to tid -- back to original values.
                                            set PartofSpeach to extractBetween("[", "]", TranslationEntry)
                                            set TranslationEntry to replaceString("[" & PartofSpeach & "]", "", TranslationEntry)
                                            if DownloadedCache does not contain SimplifiedChinese then
                                                      if AudioDownloadURL is not "" then
                                                                set AudioDownloadURL to "http://www.trainchinese.com/v1/" & replaceString("../", "", AudioDownloadURL)
                                                                set FileName to SimplifiedChinese & FileNameSuffix as string
                                                                set DownloadStatus to DownloadAudio(AudioDownloadURL, FileName)
                                                                set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
      WriteLogEntry(LogEntry, OutputLogFilePath)
                                                                set DownloadedCache to DownloadedCache & SimplifiedChinese
      delay DelayTime
                                                      else -- no audio
                                                                set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
      WriteLogEntry(LogEntry, SkippedLogFilePath)
      -- skipped
                                                      end if
                                            end if
                                  end if
                        end if
      
              end repeat
    end download_results
    
    
    
    on query_word(query)
              script o
                        property _url : "http://www.trainchinese.com/v1/a_user/index.php"
                        property delta : 0.2
      
                        property js0 : "// document ready?
    document.readyState == 'complete';
    "
                        property js1 : "// query word
    var query = '" & query & "';
    document.getElementById('searchWord').value = query;
    document.getElementById('srchEnglishBtn').click();
    "
                        property js2 : "// table ready?
    var div = document.getElementById('searchresultWindow');
    div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
    "
                        on _js2a(x)
                                  "// click 'get more results' button with given id x if it exists
    var tr = document.getElementById('" & x & "');
    if (tr != null) { tr.firstChild.firstChild.click(); }
    tr != null;
    "
                        end _js2a
      
                        on _js2a1(x)
                                  "// check if 'get more results' button with given id x no longer exists
    var tr = document.getElementById('" & x & "');
    tr == null;
    "
                        end _js2a1
      
                        property js3 : "// collect text and mp3 urls from result table
    function main() 
    {
        var div = document.getElementById('searchresultWindow');
        var table = div.firstChild;                 // table element
        var trs = table.rows;                       // tr elements collection
        
        var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
        var rr = [];                                // tr text data array
        for (var i = 0; i < trs.length; ++i)
        {
            var tr = trs.item(i);                   // tr element
            var f = tr.getAttribute('onclick');     // f = onclick function string
            var m;                                  // m = regexp match result array
            var url =                               // mp3 url
                (f != null && (m = f.match(re)) != null)
                    ? m[0] 
                    : '';
        
            var tds = tr.childNodes;                // td elements collection
            var dd = [];                            // td text data array
            for (var j = 0; j < tds.length; ++j)
            {
                var td = tds.item(j);               // td element
                dd.push(collectText(td));
            }
            dd.push(url);                           // append url to td text data array
            rr.push( dd.join('\\t') );              // tds data delimited by tab
        }
        return rr.join('\\n');                      // trs text delimited by linefeed
    }
    function collectText(td)
    //    Node td : HTML td element
    //    return string : text data extracted from td element
    {
        if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
        
        var tt = [];
        for (var i = 0; i < td.childNodes.length; ++i)
        {
            tt.push( collectText( td.childNodes.item(i) ) );
        }
        return tt.join('');
    }
    main();
    "
                        tell application "Safari"
      
      -- (0) get reference of search window (tab)
      -- get existing window (tab) if present
                                  set _tabs to tab 1 of windows whose URL = _url
                                  set _tab to missing value
                                  repeat with t in _tabs
                                            set t to t's contents
                                            if t is not missing value then
                                                      set _tab to t
                                                      exit repeat
                                            end if
                                  end repeat
      -- create new window (tab) if not present
                                  if _tab = missing value then
      make new document with properties {URL:_url}
      delay 1 -- for some reason, some delay is required here in some cases
                                            set _tab to window 1's current tab
      -- wait till document ready
                                            repeat until (do JavaScript js0 in _tab)
                                                      delay delta
                                            end repeat
                                  end if
      --return _tab
      
      -- (1) query word
      do JavaScript js1 in _tab
      
      -- (2) wait till result table ready
                                  repeat until (do JavaScript js2 in _tab)
                                            delay delta
                                  end repeat
      
      -- (2a) get more results
                                  (*
                    'get more results' button in in a tr element with id such as -
                        sr_more0, sr_more20, sr_more40, sr_more60, ...
                *)
                                  repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                                            set id_ to "sr_more" & i
                                            if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
      -- wait till result table ready
                                                      repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                                                                delay delta
                                                      end repeat
                                            else -- more button with this id_ not exist
                                                      exit repeat
                                            end if
                                  end repeat
      
      -- (3) collect text and mp3 urls from result window
                                  return do JavaScript js3 in _tab
                        end tell
              end script
              tell o to run
    end query_word
    
    
    on replaceString(oldString, newString, TextInput)
              local ASTID, oldString, newString, lst
              set ASTID to AppleScript's text item delimiters
              try
                        considering case
                                  set AppleScript's text item delimiters to oldString
                                  set lst to every text item of TextInput
                                  set AppleScript's text item delimiters to newString
                                  set TextOutput to lst as string
                        end considering
                        set AppleScript's text item delimiters to ASTID
                        return TextOutput
              on error eMsg number eNum
                        set AppleScript's text item delimiters to ASTID
                        error "Can't replaceString: " & eMsg number eNum
              end try
    end replaceString
    
    on extractBetween(startText, endText, TextInput)
              if TextInput is "" then return TextInput
              set tid to AppleScript's text item delimiters -- save them for later.
              set AppleScript's text item delimiters to startText -- find the first one.
              set endItems to text of text item -1 of TextInput -- everything after the first.
              set AppleScript's text item delimiters to endText -- find the end one.
              set TextOutput to text of text item 1 of endItems -- get the first part.
              set AppleScript's text item delimiters to tid -- back to original values.
              return TextOutput
    end extractBetween
    
    on DownloadAudio(AudioDownloadURL, FileName)
              global OutputAudioPath
              set OutputAudioPath to POSIX path of OutputAudioPath
      --try
      ----display dialog   "curl " & AudioDownloadURL & " -o " & theOutputFolder & FileName & DatabaseStamp & ".mp3"
      do shell script "curl " & AudioDownloadURL & " -o " & OutputAudioPath & FileName & ".mp3"
              return "success"
      --on error
      --return "failed"
      --end try
    end DownloadAudio
    
    
    on WriteLogEntry(LogEntry, LogFilePath)
              set LogFile to open for access LogFilePath with write permission
      write LogEntry to LogFile starting at eof as «class utf8»
      close access LogFile
    end WriteLogEntry
    
    on CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
              if DownloadStatus is "success" then
                        return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & "[sound:" & FileName & ".mp3]" & tab & DownloadStatus & return
              else
                        return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & DownloadStatus & return
              end if
    end CompileLogEntry
    
    
    on CompileCounterEntry(i)
              global startTime, StartItem, CounterLimit, NewWord
              set currentTime to do shell script "date +%s"
              set timeItTook to (currentTime - startTime)
              set timeItTookInSec to timeItTook mod 3600 mod 60 as integer
              set timeItTookInMin to timeItTook mod 3600 div 60 as integer
              set timeItTookInHours to timeItTook div 3600 as integer
      
              if timeItTookInSec < 10 then
                        set timeItTookInSec to "0" & timeItTookInSec
              end if
              if timeItTookInMin < 10 then
                        set timeItTookInMin to "0" & timeItTookInMin
              end if
              set timeCounter to "" & timeItTookInHours & ":" & timeItTookInMin & ":" & timeItTookInSec
              if timeItTook is 0 then set timeItTook to 1
              set ProgressSpeed to ((i - (StartItem - 1)) / timeItTook)
              set RemainingTime to (CounterLimit - i) / ProgressSpeed
              set RemainingTime_Sec to RemainingTime mod 3600 mod 60 as integer
              set RemainingTime_Min to RemainingTime mod 3600 div 60 as integer
              set RemainingTime_Hours to RemainingTime div 3600
              set RemainingTime_Days to ((RemainingTime div 3600) div 24)
      
              if RemainingTime_Sec < 10 then
                        set RemainingTime_Sec to "0" & RemainingTime_Sec
              end if
              if RemainingTime_Min < 10 then
                        set RemainingTime_Min to "0" & RemainingTime_Min
              end if
      
              if RemainingTime_Days > 1 then
                        set RemainingTime_Days to RemainingTime_Days & " days "
                        set RemainingTimeCounter to RemainingTime_Days
              else
                        if RemainingTime_Days = 1 then
                                  set RemainingTime_Days to RemainingTime_Days & " day " & (RemainingTime_Hours - 24) & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
                                  set RemainingTimeCounter to RemainingTime_Days
      
                        else
                                  set RemainingTimeCounter to "" & RemainingTime_Hours & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
      
                        end if
              end if
              set PercentDisplay to "" & (i / CounterLimit * 100) * 100 div 100 & "," & (i / CounterLimit * 100) * 100 mod 100 * 100 div 100 & "%"
      
              set ProgressSpeedDisplay to ((((i - (StartItem - 1)) / (timeItTook / 60))) as string) & " words/min"
      
              set CounterEntry to ("Processing item " & i & " of " & CounterLimit & "   (" & PercentDisplay & ")" & return & return & NewWord & return & "Processed:" & tab & (i - (StartItem - 1)) & return & "Remaining:" & tab & (CounterLimit - i) & return & return & return & "Runtime: " & timeCounter & return & "Remaining: " & RemainingTimeCounter & return & "Speed: " & ProgressSpeedDisplay) as «class utf8»
    end CompileCounterEntry
    
    on WriteCounterEntry(CounterEntry, LogFilePath)
              set LogFile to open for access LogFilePath with write permission
      set eof of CounterLog to 0 --emptying file contents if needed
      write LogEntry to LogFile starting at eof as «class utf8»
      close access LogFile
    end WriteCounterEntry
    
    

     

     

    Hope this works for you

     

    Best regards,

     

    Queerly

  • Hiroto Level 5 Level 5 (5,675 points)

    Hello Queerly,

     

    I have completely overlooked the possibility of search returning "not found" result...

    Here's a revised code to handle such case correctly, which also added time_out property to set the maximum time to wait for each operation to complete. Currently time_out is set to 10 [sec] which means it throws error when data is not ready in 10 sec after request is sent. This time_out feature is only to keep script from being stuck in loop in unexpected cases. When search returns "not found" result within time_out period, revised query_word() handler returns empty string.

     

    So the usage would be something like this:

     

    try
        set query_results to query_word(q)
    on error errs number errn
        -- log error
    end try
    if query_results = "" then
        -- log no-result
    else
        -- process query_results
    end if
    

     

     

    Here's the script.

    --------------------------------

    set q to "汉语"
    --set q to "仏" -- not found in database
    --set q to "佛"
    set r to query_word(q)
    --set the clipboard to r
    return r
    
    on query_word(query)
        script o
            property _url : "http://www.trainchinese.com/v1/a_user/index.php"
            property delta : 0.2 -- interval to check if data is ready [sec]
            property time_out : 10 -- time to wait for data to be ready [sec]
    
            property js0 : "// document ready status
    // 0 = not ready; 1 = ready
    document.readyState == 'complete' ? 1 : 0;
    "
            property js1 : "// query word
    var query = '" & query & "';
    document.getElementById('searchWord').value = query;
    document.getElementById('srchEnglishBtn').click();
    "
            property js2 : "// table ready status
    // 0 = result is not ready; 1 = table is ready; 2 = not found
    var div = document.getElementById('searchresultWindow');
    div.hasChildNodes() && (div.firstChild.tagName == 'TABLE') ? 1 :
    div.hasChildNodes() && (div.firstChild.tagName == 'DIV')   ? 2 : 0;
    "
            on _js2a(x)
                "// click 'get more results' button with given id x if it exists
    var tr = document.getElementById('" & x & "');
    if (tr != null) { tr.firstChild.firstChild.click(); }
    tr != null;
    "
            end _js2a
    
            on _js2a1(x)
                "// existence status of 'get more results' button with given id x
    // 0 = the button exists; 1 = the button does not exist
    var tr = document.getElementById('" & x & "');
    tr == null ? 1 : 0;
    "
            end _js2a1
    
            property js3 : "// collect text and mp3 urls from result table
    function main() 
    {
        var div = document.getElementById('searchresultWindow');
        var table = div.firstChild;                 // table element
        var trs = table.rows;                       // tr elements collection
    
        var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
        var rr = [];                                // tr text data array
        for (var i = 0; i < trs.length; ++i)
        {
            var tr = trs.item(i);                   // tr element
            var f = tr.getAttribute('onclick');     // f = onclick function string
            var m;                                  // m = regexp match result array
            var url =                               // mp3 url
                (f != null && (m = f.match(re)) != null)
                    ? m[0] 
                    : '';
    
            var tds = tr.childNodes;                // td elements collection
            var dd = [];                            // td text data array
            for (var j = 0; j < tds.length; ++j)
            {
                var td = tds.item(j);               // td element
                dd.push(collectText(td));
            }
            dd.push(url);                           // append url to td text data array
            rr.push( dd.join('\\t') );              // tds data delimited by tab
        }
        return rr.join('\\n');                      // trs text delimited by linefeed
    }
    function collectText(td)
    //    Node td : HTML td element
    //    return string : text data extracted from td element
    {
        if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
    
        var tt = [];
        for (var i = 0; i < td.childNodes.length; ++i)
        {
            tt.push( collectText( td.childNodes.item(i) ) );
        }
        return tt.join('');
    }
    main();
    "
            tell application "Safari"
    
                -- (0) get reference of search window (tab)
                -- get existing window (tab) if present
                set _tabs to tab 1 of windows whose URL = _url
                set _tab to missing value
                repeat with t in _tabs
                    set t to t's contents
                    if t is not missing value then
                        set _tab to t
                        exit repeat
                    end if
                end repeat
                -- create new window (tab) if not present
                if _tab = missing value then
                    make new document with properties {URL:_url}
                    delay 1 -- for some reason, some delay is required here in some cases
                    set _tab to window 1's current tab
                    -- wait till document ready
                    set r to 0
                    repeat (time_out div delta) times
                        set r to do JavaScript js0 in _tab
                        if r = 1 then exit repeat
                        delay delta
                    end repeat
                    if r = 0 then error "query_word(): document not ready in " & time_out & " [sec]" number 8000
                end if
                --return _tab
    
                -- (1) query word
                do JavaScript js1 in _tab
    
                -- (2) wait till result is ready
                set r to 0
                repeat (time_out div delta) times
                    set r to do JavaScript js2 in _tab
                    if r = 2 then return "" -- not found
                    if r = 1 then exit repeat -- table readay
                    delay delta
                end repeat
                if r = 0 then error "query_word(): result not ready in " & time_out & " [sec]" number 8001
    
                -- (2a) get more results
                (*
                    'get more results' button is in a tr element with id such as -
                        sr_more0, sr_more20, sr_more40, sr_more60, ...
                *)
                repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                    set id_ to "sr_more" & i
                    if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click button with id_
                        -- wait till result table ready
                        set r to 0
                        repeat (time_out div delta) times
                            set r to do JavaScript (my _js2a1(id_)) in _tab
                            if r = 1 then exit repeat
                            delay delta
                        end repeat
                        if r = 0 then error "query_word(): table not ready in " & time_out & " [sec]" number 8002
                    else -- button with this id_ not exist
                        exit repeat
                    end if
                end repeat
    
                -- (3) collect text and mp3 urls from result window
                return do JavaScript js3 in _tab
            end tell
        end script
        tell o to run
    end query_word
    

    --------------------------------

     

    And thank you for posting your entire script. It is very nice of you to be considerate of the proper use of the site.

    I'll peruse it later. Also I'm not using any flashcard programme for now but will check some.

     

    Oh, and as for creating directories, it's simple. You may use "mkdir -p" like this.

     

    set dir_path to "/Users/chinskycraze/Desktop/!NCIKU_Dictionaries/Output/Audio"
    do shell script "mkdir -p " & dir_path's quoted form
    

     

    The "-p" option let it create all sub directories in the path if not present.

     

    Best wishes from Japan,

    Hiroto