returning a content of DOM elements using AppleScript

Question

Level 1

0 points

returning a content of DOM elements using AppleScript

Hi,

I've been dabbling in AppleScript for a few months to automate the mundane process of compiling learning materials for my Chinese studies. Now, I'm trying to write a script that would enable me to automatically search for example sentences of Chinese vocabs on a free Internet site.

What I've managed to do so far is to make AppleScript open a list of vocabs from a text file. Then I usea repeat-loop for each single vocab to open the dictionary's page, use JavaScript to input the vocab into the search field and press the search button. That part works fine, but where I'm stuck is the bit where I'm supposed to get the results the site returns. These are not returned in the HTML code itself, but rather through the DOM structure. I only have a vary vague idea of how DOM works and my attempts so far have not done any good.

This is how I tried to manipulate Javascript

set DelayTimeLong to 1 on LoadURL(theURL, DelayTimeLong) global SafariID tell application "Safari" set URL of document 1 of window id SafariID to theURL delay DelayTimeLong repeat ----use Safari's 'do JavaScript' to check a page's status if (do JavaScript "document.readyState" in document 1 of window id SafariID) is "complete" then exit repeat delay 1 end repeat end tell end LoadURL - initiate safar:::::::::::::::: tell application "Safari" activate make new document with properties {URL:"http://google.com"} set SafariID to id of window 1 -- store the window ID end tell set NewWord to "汉语" set theURL to "http://www.trainchinese.com/v1/a_user/index.php" LoadURL(theURL, DelayTimeLong) delay 3 tell application "Safari" do JavaScript "document.getElementById('searchWord').value='" & NewWord & "'" in document 1 of window id SafariID delay 3 do JavaScript "document.getElementById('srchEnglishBtn').click()" in document 1 of window id SafariID set SearchResults to do JavaScript "MySearchResults = document.getElementById('searchresultWindow').getDOM" in document 1 of window id SafariID set SearchResults to do JavaScript "MySearchResults = document.documentElement.innerHTML" in document 1 of window id SafariID return SearchResults end tell

The results are contained hierarchically within the element called with the ID searchresultWindow so that's where the first line is from. The second line was an attempt to get the entire DOM structure, in the hope of being able to search through it, but that failed too. What I got was just the Source Code of the Weppage itself.

Solutions to the problem would be appreciated. So would any prods in the right direction.

Cheers!

MacBook Pro, Mac OS X (10.7.4)

Posted on Feb 25, 2013 7:56 AM

Reply

Answer 1

Best reply

Hiroto

Level 5

7,461 points

Feb 25, 2013 7:46 PM in response to Queerly

Hello

You may try something like the following script.

It will query the given word and return the contents of search result table as plain text of rows of tab delimited fields. Extraction is done by javascript and DOM.

--applescript
set query to "汉语"
set r to query_word(query)

--set the clipboard to r
return r


on query_word(query)
    script o
        property _url : "http://www.trainchinese.com/v1/a_user/index.php"
        
        property js0 : "// document ready?
document.readyState == 'complete';
"
        property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
        property js2 : "// table ready?
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
"
        property js3 : "// collect text from result table
function main() 
{
    var div = document.getElementById('searchresultWindow');
    var table = div.firstChild;                 // table element
    var trs = table.rows;                       // tr elements collection
    
    var rr = [];                                // tr text data array
    for (var i = 0; i < trs.length; ++i)
    {
        var tr = trs.item(i);                   // tr element
        var tds = tr.childNodes;                // td elements collection
        var dd = [];                            // td text data array
        for (var j = 0; j < tds.length; ++j)
        {
            var td = tds.item(j);               // td element
            dd.push(collectText(td));
        }
        rr.push( dd.join('\\t') );              // tds text delimited by tab
    }
    return rr.join('\\n');                      // trs text delimited by linefeed
}
function collectText(td)
//    Node td : HTML td element
//    return string : text data extracted from td element
{
    if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
    
    var tt = [];
    for (var i = 0; i < td.childNodes.length; ++i)
    {
        tt.push( collectText( td.childNodes.item(i) ) );
    }
    return tt.join('');
}
main();
"
        tell application "Safari"
            -- (0) get reference of search window (tab)
            -- get existing window (tab) if present
            set _tabs to tab 1 of windows whose URL = _url
            set _tab to missing value
            repeat with t in _tabs
                set t to t's contents
                if t is not missing value then
                    set _tab to t
                    exit repeat
                end if
            end repeat
            -- create new window (tab) if not present
            if _tab = missing value then
                make new document with properties {URL:_url}
                delay 1 -- for some reason, some delay may be required here.
                set _tab to window 1's current tab
                -- wait till document ready
                repeat until (do JavaScript js0 in _tab)
                    delay 0.5
                end repeat
            end if
            --return _tab
            
            -- (1) query word
            do JavaScript js1 in _tab
            
            -- (2) wait till result table ready
            repeat until (do JavaScript js2 in _tab)
                delay 0.5
            end repeat
            
            -- (3) collect text from result window
            return do JavaScript js3 in _tab
        end tell
    end script
    tell o to run
end query_word
--end of applescript

Please be nice to the site and do not abuse it with intensive automated searches.

Regards,

H

Reply

Answer 2

Queerly Author

Level 1

0 points

Feb 27, 2013 2:31 PM in response to Hiroto

Hi Hiroto,

first of all, let me assure you that you do not have to worry about abuses, since the cababilities of human brain set a very clear search limit. There's only so many words a week you can study. I might want to make a couple dozen searches at first to catch up on my uni workload, but on the whole, there shouldn't be more than 10-15 search words a week.

Secondly, let me say WOW! I had expected a hint here and there, a prod in the general direction, maybe even a complete aswer to my question. I did not expect to find a complete solution! And this script runs so moothly and so fast. And it delivers the results delimitere by tabs, so that I can neatly format the information into a flashcard field. I can't really tell you how greatful I am.

The only drawback to this approach is I can't make head or tail of what the Java commands all do, or rather I know what they do (thank you for the commentaries), but not how they do it. I'm terribly ignorant of JavaScript.

There are, however, two questions I like to ask:

1) Is there a way to make the script make sure that all results are displayed before retrieving the table. By default the table shows 20 resutls, but if more sentences are in the database, a "Get more results" button is displayed.

2) Secondly, the first position in line is a ➕ sign that lets you open each entry separately. Is there any way to get the id for this object so that I could open each card in turn and extract the mp3 file with pronunciations saving it under the entry's name? I was considering adding each sentence as a separate card in my flashcard programme to practice pronunciation and word usage at the same time. Chinese is an awfully difficult language for native speakers of European languages 😟 (not that you would know 😀, from my experience the Japanese are brilliant at Chinese). Having the audio file in my flashcard programme would be a great help. In fact, the site has their own free flashcard programme / iPhone app, but unfortunately the space repetition algorythym is terrible and not very productive, so in the end I stopped using it. Hence this attempt.

Thanks again,

Cheers!

Reply

Answer 3

Hiroto

Level 5

7,461 points

Feb 27, 2013 11:54 PM in response to Queerly

Hello

Glad to know it helped somehow.

As far as I have browsed the relevant parts of the html source of the site, I think both of your requests can be done. However, it'll take me some time to revise the script, for I have things to do. In a few days, maybe.

And yes, we Japanese are familiar with (traditional) Chinese characters because we use (a variant of) them in every life. But most of us cannot speak Chinese or understand spoken Chinese because the pronounciation of character is completely different from that in Japanese. So I'm interested in collecting the pronounced phrases as well.

Later,

H

Reply

Answer 4

Queerly Author

Level 1

0 points

Feb 28, 2013 2:46 PM in response to Hiroto

Hi,

Take your time, there's honestly no hurry. I have pleanty of other vocabs/material to study, and a BA thesis on the plate, too. That's why I was thinking of turning the time-consuming task of preparing flash-cards into an automated process. Your solution already gives me what I need to easily find example sentences and import them in my flashcard programme. As for the getting the audio files and import the phrases/sentences as separate cards, that'd be great, but it can definitely wait a few days or even a few weeks if need be. Your time, your call. But I really hope you will find some time some day to write the script.

I'm not sure how you want to use the material. For my purposes, in addition to downloading the MP3s, I planned on to save the the data for each sentence/phrase in a tab delimitered .txt file that I can then import into the flashcard programme. I thought on getting all the info available on the site (plus writing down where the sentence/phrase comes from - I noticed this helps me remember the word better),i.e., save it a text file like this:

Simplified_Chinese & tab & Traditional_Chinese & tab & Pinyin & tab & English_Translation & tab & "TrainChinese" & return

Not sure if that would work for you.

On a different note, how long have you been learning Chinese? What level are you at? Any resources you could recomend? I know this is not the right forum for it, so feel free to PM me.

Cheers!

Queerly

Reply

Answer 5

Hiroto

Level 5

7,461 points

Mar 4, 2013 8:10 AM in response to Queerly

Hello

Sorry for late reply.

Script listed below is the first revised version to collect all results by pressing 'get more results' button.

Retrieving mp3 files is yet to be done. Clicking every entry in result table seems to be the only way to get the url of corresponding mp3 file. I'm going to examine it more closer.

set q to "汉语"
set r to query_word(q)
--set the clipboard to r
return r

on query_word(query)
    script o
        property _url : "http://www.trainchinese.com/v1/a_user/index.php"
        property delta : 0.2
        
        property js0 : "// document ready?
document.readyState == 'complete';
"
        property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
        property js2 : "// table ready?
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
"
        
        property js3 : "// collect text from result table
function main() 
{
    var div = document.getElementById('searchresultWindow');
    var table = div.firstChild;                 // table element
    var trs = table.rows;                       // tr elements collection
    
    var rr = [];                                // tr text data array
    for (var i = 0; i < trs.length; ++i)
    {
        var tr = trs.item(i);                   // tr element
        var tds = tr.childNodes;                // td elements collection
        var dd = [];                            // td text data array
        for (var j = 0; j < tds.length; ++j)
        {
            var td = tds.item(j);               // td element
            dd.push(collectText(td));
        }
        rr.push( dd.join('\\t') );              // tds text delimited by tab
    }
    return rr.join('\\n');                      // trs text delimited by linefeed
}
function collectText(td)
//    Node td : HTML td element
//    return string : text data extracted from td element
{
    if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
    
    var tt = [];
    for (var i = 0; i < td.childNodes.length; ++i)
    {
        tt.push( collectText( td.childNodes.item(i) ) );
    }
    return tt.join('');
}
main();
"
        on _js2a(x)
            "// click 'get more results' button with given id x if it exists
var tr = document.getElementById('" & x & "');
if (tr != null) { tr.firstChild.firstChild.click(); }
tr != null;
"
        end _js2a
        
        on _js2a1(x)
            "// check if 'get more results' button with given id x no longer exists
var tr = document.getElementById('" & x & "');
tr == null;
"
        end _js2a1
        
        tell application "Safari"
            
            -- (0) get reference of search window (tab)
            -- get existing window (tab) if present
            set _tabs to tab 1 of windows whose URL = _url
            set _tab to missing value
            repeat with t in _tabs
                set t to t's contents
                if t is not missing value then
                    set _tab to t
                    exit repeat
                end if
            end repeat
            -- create new window (tab) if not present
            if _tab = missing value then
                make new document with properties {URL:_url}
                delay 1 -- for some reason, some delay is required here in some cases
                set _tab to window 1's current tab
                -- wait till document ready
                repeat until (do JavaScript js0 in _tab)
                    delay delta
                end repeat
            end if
            --return _tab
            
            -- (1) query word
            do JavaScript js1 in _tab
            
            -- (2) wait till result table ready
            repeat until (do JavaScript js2 in _tab)
                delay delta
            end repeat
            
            -- (2a) get more results
            (*
                'get more results' button in in a tr element with id such as -
                    sr_more0, sr_more20, sr_more40, sr_more60, ...
            *)
            repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                set id_ to "sr_more" & i
                if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
                    -- wait till result table ready
                    repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                        delay delta
                    end repeat
                else -- more button with this id_ not exist
                    exit repeat
                end if
            end repeat
            
            -- (3) collect text from result window
            return do JavaScript js3 in _tab
        end tell
    end script
    tell o to run
end query_word

Well, my level of spoken Chinese is virtually zero so far. However, I understand written Chinese rather well. And indeed I often read very old Chinese translations of Sanskrit text in Buddhism, which were mainly brought about in 3rd through 8th centuries. You might think it strange that one who cannot catch spoken words can comprehend written text. But it is the magic of ideograph. 😉

That's all for now.

Hiroto

Reply

Answer 6

Hiroto

Level 5

7,461 points

Mar 8, 2013 5:55 AM in response to Queerly

Hello

My homework is almost done. 😉

Close examination of the html source of the site revealed that we can get the sound urls without clicking each phrase and opening its sub window in result table. The urls are passed as function's argument which is called as onclick handler of each row in table. So we can parse and retrieve the urls directly from the table. This helps really.

Script below will retrieve data from result table in the form as -

[] \t [simplified Chinese] \t [traditional Chinese] \t [pinyin] \t [English translation] \t [pronunciation url]

The first empty string [] comes from null-text td element of each row in table, which is for [+] mark. Some entries do not have [pronunciation url] in which case it will be empty string in the result. Urls are relative to the search page.

The current script does not download mp3 files because there are ways to organise them locally and I'm yet to decide.

E.g.

a) name the file as [Chinese phrase].mp3 and put it in a [pronunciation] directory; or,

b) reproduce the original directory structure locally and there put the file with the original name.

I'm inclined to choose a), for it is simple to use and maintain. One problem of a) would be that file name can be a long Chinese string.

You may decide the scheme yourself and download the mp3 files. All the necessary information are collected by this script.

I'm going to implement my downloading script later, probably in next week.

set q to "汉语"
set r to query_word(q)
--set the clipboard to r
return r

on query_word(query)
    script o
        property _url : "http://www.trainchinese.com/v1/a_user/index.php"
        property delta : 0.2
        
        property js0 : "// document ready?
document.readyState == 'complete';
"
        property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
        property js2 : "// table ready?
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
"
        on _js2a(x)
            "// click 'get more results' button with given id x if it exists
var tr = document.getElementById('" & x & "');
if (tr != null) { tr.firstChild.firstChild.click(); }
tr != null;
"
        end _js2a
        
        on _js2a1(x)
            "// check if 'get more results' button with given id x no longer exists
var tr = document.getElementById('" & x & "');
tr == null;
"
        end _js2a1
        
        property js3 : "// collect text and mp3 urls from result table
function main() 
{
    var div = document.getElementById('searchresultWindow');
    var table = div.firstChild;                 // table element
    var trs = table.rows;                       // tr elements collection
    
    var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
    var rr = [];                                // tr text data array
    for (var i = 0; i < trs.length; ++i)
    {
        var tr = trs.item(i);                   // tr element
        var f = tr.getAttribute('onclick');     // f = onclick function string
        var m;                                  // m = regexp match result array
        var url =                               // mp3 url
            (f != null && (m = f.match(re)) != null)
                ? m[0] 
                : '';
    
        var tds = tr.childNodes;                // td elements collection
        var dd = [];                            // td text data array
        for (var j = 0; j < tds.length; ++j)
        {
            var td = tds.item(j);               // td element
            dd.push(collectText(td));
        }
        dd.push(url);                           // append url to td text data array
        rr.push( dd.join('\\t') );              // tds data delimited by tab
    }
    return rr.join('\\n');                      // trs text delimited by linefeed
}
function collectText(td)
//    Node td : HTML td element
//    return string : text data extracted from td element
{
    if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
    
    var tt = [];
    for (var i = 0; i < td.childNodes.length; ++i)
    {
        tt.push( collectText( td.childNodes.item(i) ) );
    }
    return tt.join('');
}
main();
"
        tell application "Safari"
            
            -- (0) get reference of search window (tab)
            -- get existing window (tab) if present
            set _tabs to tab 1 of windows whose URL = _url
            set _tab to missing value
            repeat with t in _tabs
                set t to t's contents
                if t is not missing value then
                    set _tab to t
                    exit repeat
                end if
            end repeat
            -- create new window (tab) if not present
            if _tab = missing value then
                make new document with properties {URL:_url}
                delay 1 -- for some reason, some delay is required here in some cases
                set _tab to window 1's current tab
                -- wait till document ready
                repeat until (do JavaScript js0 in _tab)
                    delay delta
                end repeat
            end if
            --return _tab
            
            -- (1) query word
            do JavaScript js1 in _tab
            
            -- (2) wait till result table ready
            repeat until (do JavaScript js2 in _tab)
                delay delta
            end repeat
            
            -- (2a) get more results
            (*
                'get more results' button in in a tr element with id such as -
                    sr_more0, sr_more20, sr_more40, sr_more60, ...
            *)
            repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                set id_ to "sr_more" & i
                if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
                    -- wait till result table ready
                    repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                        delay delta
                    end repeat
                else -- more button with this id_ not exist
                    exit repeat
                end if
            end repeat
            
            -- (3) collect text and mp3 urls from result window
            return do JavaScript js3 in _tab
        end tell
    end script
    tell o to run
end query_word

Kind regards,

H

Reply

Answer 7

Queerly Author

Level 1

0 points

Mar 8, 2013 6:19 PM in response to Hiroto

Hi Hiroto,

This is mighty. I'd never come up with that, simply don't know enough JavaScript. I found, however, one problem with the script in the current form. It works fine provided the search term is in the dictionary. If not, it'd be locked in an infinite loop waiting for the results which do not exist to load. It'd be great if you could find the time to revise the script to check whether any results are returned. I can't do it myself, cos the script is too complex for my understanding.

However, where I could help is getting the files to be downloaded. I wrote the script that will save the mp3 in a separate Audio folder (the subfolder must exist in the output directory, I didn't know how to create it per shellscript, so I just did it manually) under their Chinese names. It will also allow you to specify a Suffix to the file name (you may leave it blank if you don't want to, but I use it to distinguish between files pronounced by male and female speakers - the database is all female).

The script that will also perform a search on a number of new words from a text file with one word per line. It will save the results to two tab delimited text files, one with entries with audio and a second one with entries without audio. It will also extract the part of speach information into a separate field and save these as well. Additionaly, the textfile for entries with audio files will have an extra collumn with the following entry "[sound:" & Filename.mp3 & "]", cos that's the tag my flashcard uses to incorporate audio files. There is also an extra check column with the status of the download e.g. success or failed. If you find "failed" in the textfile for entries without audio, then it means the script failed at extracting the proper URL for the mp3. That did not happen with me so far.

Oh yeah, there's a dialog box asking you to specyfy delay time between separate downloads of mp3. It's to simulate more human-like behaviour so that the server does not feel under attack. The default time is 5 sec.

Also, more importantly, the script checks the audio folder for already existing files and will check before downloading an mp3 if it is not already downloaded so that duplicate files will be skipped. It's to protect the server as well.

There's also an extra file being generated, called !Counter.txt. It's there to indicate the progress of the current task and to help you find out what was the last successfullly processed item in case the script crashes or the internet connection drops.

Sorry if the script seems a bit messy. I'm not very experience with programming. I'm sure you could do a much neater job out of it. But I hated the fact that you end up doing all the work while I'm just sitting here waiting for the end result like a spoilt brat. I wanted to help out a bit too.

So, if you can correct the infinite loop problem and maybe add the automatic generation of "Audio" subfolder in the output folder if it does not alrady exist there, I think we'd be done with the script. Though, of course, feel free to improve the script even further.

set NewWordList to paragraphs of (read (choose file with prompt "Open List of new files") as «class utf8»)
set OutputFolderPath to (choose folder with prompt "Open Output Folder")
set OutputAudioPath to OutputFolderPath & "Audio:" as string
set OutputLogFilePath to OutputFolderPath & "!Search_Results_with_Audio.txt" as string
set SkippedLogFilePath to OutputFolderPath & "!Search_Results_NO_Audio.txt" as string
set CounterLogFilePath to OutputFolderPath & "!!!Progress_Counter.txt" as string
set PosixOutputAudioPath to POSIX path of OutputAudioPath
--return PosixOutputAudioPath
do shell script "mkdir \"$/Users/chinskycraze/Desktop/!NCIKU_Dictionaries/Output/Audio\""
set StartItem to text returned of (display dialog "Set start item to:" default answer 1)

set DelayTime to text returned of (display dialog "In order to prevent misuse of the Online Dictionary and make this search seem more natural to the administrator of the domane, please choose the length of delay (in sec) between downloading each audio file." default answer 5)

set FileNameSuffix to text returned of (display dialog "Each audio file will be automatically named after the Chinese entry, but you may choose to add as suffix at the end of the filename of each file? What suffix should be added? (Leave out blank, if you do not want to modify the name at all.)" default answer "_f")

set DownloadedCache to (list folder OutputAudioPath without invisibles)
repeat with i from 1 to count of DownloadedCache
          set item i of DownloadedCache to replaceString("_f.mp3", "", item i of DownloadedCache)
end repeat
set startTime to do shell script "date +%s"
set CounterLimit to count of NewWordList

repeat with i from StartItem to CounterLimit
          set NewWord to item i of NewWordList
          set query_results to query_word(NewWord)
  
  download_results(query_results)
          set CounterEntry to CompileCounterEntry(i)
  WriteLogEntry(CounterEntry, CounterLogFilePath)
end repeat

on download_results(query_results)
          global FileNameSuffix, OutputFolderPath, OutputAudioPath, FileNameSuffix, DelayTime, OutputLogFilePath, SkippedLogFilePath, DownloadedCache
          set SearchList to paragraphs of replaceString("           ", "", query_results)
          repeat with i from 1 to count of SearchList
                    set currentItem to item i of SearchList
                    if currentItem is not "" then
                              if currentItem is not "          Chinese          Pinyin          Translation          " then
                                        set SimplifiedChinese to ""
                                        set TraditionalChinese to ""
                                        set PinyinEntry to ""
                                        set AudioDownloadURL to ""
                                        set TranslationEntry to ""
                                        set PartofSpeach to ""
                                        set DownloadStatus to ""
                                        set FileName to ""
                                        set tid to AppleScript's text item delimiters -- save them for later
                                        set AppleScript's text item delimiters to tab
                                        set currentItemCont to text items of currentItem
                                        set SimplifiedChinese to item 1 of currentItemCont
                                        set TraditionalChinese to item 2 of currentItemCont
                                        set PinyinEntry to item 3 of currentItemCont
                                        set TranslationEntry to item 4 of currentItemCont
                                        set AudioDownloadURL to last item of currentItemCont
                                        set AppleScript's text item delimiters to tid -- back to original values.
                                        set PartofSpeach to extractBetween("[", "]", TranslationEntry)
                                        set TranslationEntry to replaceString("[" & PartofSpeach & "]", "", TranslationEntry)
                                        if DownloadedCache does not contain SimplifiedChinese then
                                                  if AudioDownloadURL is not "" then
                                                            set AudioDownloadURL to "http://www.trainchinese.com/v1/" & replaceString("../", "", AudioDownloadURL)
                                                            set FileName to SimplifiedChinese & FileNameSuffix as string
                                                            set DownloadStatus to DownloadAudio(AudioDownloadURL, FileName)
                                                            set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
  WriteLogEntry(LogEntry, OutputLogFilePath)
                                                            set DownloadedCache to DownloadedCache & SimplifiedChinese
  delay DelayTime
                                                  else -- no audio
                                                            set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
  WriteLogEntry(LogEntry, SkippedLogFilePath)
  -- skipped
                                                  end if
                                        end if
                              end if
                    end if
  
          end repeat
end download_results



on query_word(query)
          script o
                    property _url : "http://www.trainchinese.com/v1/a_user/index.php"
                    property delta : 0.2
  
                    property js0 : "// document ready?
document.readyState == 'complete';
"
                    property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
                    property js2 : "// table ready?
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
"
                    on _js2a(x)
                              "// click 'get more results' button with given id x if it exists
var tr = document.getElementById('" & x & "');
if (tr != null) { tr.firstChild.firstChild.click(); }
tr != null;
"
                    end _js2a
  
                    on _js2a1(x)
                              "// check if 'get more results' button with given id x no longer exists
var tr = document.getElementById('" & x & "');
tr == null;
"
                    end _js2a1
  
                    property js3 : "// collect text and mp3 urls from result table
function main() 
{
    var div = document.getElementById('searchresultWindow');
    var table = div.firstChild;                 // table element
    var trs = table.rows;                       // tr elements collection
    
    var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
    var rr = [];                                // tr text data array
    for (var i = 0; i < trs.length; ++i)
    {
        var tr = trs.item(i);                   // tr element
        var f = tr.getAttribute('onclick');     // f = onclick function string
        var m;                                  // m = regexp match result array
        var url =                               // mp3 url
            (f != null && (m = f.match(re)) != null)
                ? m[0] 
                : '';
    
        var tds = tr.childNodes;                // td elements collection
        var dd = [];                            // td text data array
        for (var j = 0; j < tds.length; ++j)
        {
            var td = tds.item(j);               // td element
            dd.push(collectText(td));
        }
        dd.push(url);                           // append url to td text data array
        rr.push( dd.join('\\t') );              // tds data delimited by tab
    }
    return rr.join('\\n');                      // trs text delimited by linefeed
}
function collectText(td)
//    Node td : HTML td element
//    return string : text data extracted from td element
{
    if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node
    
    var tt = [];
    for (var i = 0; i < td.childNodes.length; ++i)
    {
        tt.push( collectText( td.childNodes.item(i) ) );
    }
    return tt.join('');
}
main();
"
                    tell application "Safari"
  
  -- (0) get reference of search window (tab)
  -- get existing window (tab) if present
                              set _tabs to tab 1 of windows whose URL = _url
                              set _tab to missing value
                              repeat with t in _tabs
                                        set t to t's contents
                                        if t is not missing value then
                                                  set _tab to t
                                                  exit repeat
                                        end if
                              end repeat
  -- create new window (tab) if not present
                              if _tab = missing value then
  make new document with properties {URL:_url}
  delay 1 -- for some reason, some delay is required here in some cases
                                        set _tab to window 1's current tab
  -- wait till document ready
                                        repeat until (do JavaScript js0 in _tab)
                                                  delay delta
                                        end repeat
                              end if
  --return _tab
  
  -- (1) query word
  do JavaScript js1 in _tab
  
  -- (2) wait till result table ready
                              repeat until (do JavaScript js2 in _tab)
                                        delay delta
                              end repeat
  
  -- (2a) get more results
                              (*
                'get more results' button in in a tr element with id such as -
                    sr_more0, sr_more20, sr_more40, sr_more60, ...
            *)
                              repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                                        set id_ to "sr_more" & i
                                        if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
  -- wait till result table ready
                                                  repeat until (do JavaScript (my _js2a1(id_)) in _tab)
                                                            delay delta
                                                  end repeat
                                        else -- more button with this id_ not exist
                                                  exit repeat
                                        end if
                              end repeat
  
  -- (3) collect text and mp3 urls from result window
                              return do JavaScript js3 in _tab
                    end tell
          end script
          tell o to run
end query_word


on replaceString(oldString, newString, TextInput)
          local ASTID, oldString, newString, lst
          set ASTID to AppleScript's text item delimiters
          try
                    considering case
                              set AppleScript's text item delimiters to oldString
                              set lst to every text item of TextInput
                              set AppleScript's text item delimiters to newString
                              set TextOutput to lst as string
                    end considering
                    set AppleScript's text item delimiters to ASTID
                    return TextOutput
          on error eMsg number eNum
                    set AppleScript's text item delimiters to ASTID
                    error "Can't replaceString: " & eMsg number eNum
          end try
end replaceString

on extractBetween(startText, endText, TextInput)
          if TextInput is "" then return TextInput
          set tid to AppleScript's text item delimiters -- save them for later.
          set AppleScript's text item delimiters to startText -- find the first one.
          set endItems to text of text item -1 of TextInput -- everything after the first.
          set AppleScript's text item delimiters to endText -- find the end one.
          set TextOutput to text of text item 1 of endItems -- get the first part.
          set AppleScript's text item delimiters to tid -- back to original values.
          return TextOutput
end extractBetween

on DownloadAudio(AudioDownloadURL, FileName)
          global OutputAudioPath
          set OutputAudioPath to POSIX path of OutputAudioPath
  --try
  ----display dialog   "curl " & AudioDownloadURL & " -o " & theOutputFolder & FileName & DatabaseStamp & ".mp3"
  do shell script "curl " & AudioDownloadURL & " -o " & OutputAudioPath & FileName & ".mp3"
          return "success"
  --on error
  --return "failed"
  --end try
end DownloadAudio


on WriteLogEntry(LogEntry, LogFilePath)
          set LogFile to open for access LogFilePath with write permission
  write LogEntry to LogFile starting at eof as «class utf8»
  close access LogFile
end WriteLogEntry

on CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
          if DownloadStatus is "success" then
                    return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & "[sound:" & FileName & ".mp3]" & tab & DownloadStatus & return
          else
                    return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & DownloadStatus & return
          end if
end CompileLogEntry


on CompileCounterEntry(i)
          global startTime, StartItem, CounterLimit, NewWord
          set currentTime to do shell script "date +%s"
          set timeItTook to (currentTime - startTime)
          set timeItTookInSec to timeItTook mod 3600 mod 60 as integer
          set timeItTookInMin to timeItTook mod 3600 div 60 as integer
          set timeItTookInHours to timeItTook div 3600 as integer
  
          if timeItTookInSec < 10 then
                    set timeItTookInSec to "0" & timeItTookInSec
          end if
          if timeItTookInMin < 10 then
                    set timeItTookInMin to "0" & timeItTookInMin
          end if
          set timeCounter to "" & timeItTookInHours & ":" & timeItTookInMin & ":" & timeItTookInSec
          if timeItTook is 0 then set timeItTook to 1
          set ProgressSpeed to ((i - (StartItem - 1)) / timeItTook)
          set RemainingTime to (CounterLimit - i) / ProgressSpeed
          set RemainingTime_Sec to RemainingTime mod 3600 mod 60 as integer
          set RemainingTime_Min to RemainingTime mod 3600 div 60 as integer
          set RemainingTime_Hours to RemainingTime div 3600
          set RemainingTime_Days to ((RemainingTime div 3600) div 24)
  
          if RemainingTime_Sec < 10 then
                    set RemainingTime_Sec to "0" & RemainingTime_Sec
          end if
          if RemainingTime_Min < 10 then
                    set RemainingTime_Min to "0" & RemainingTime_Min
          end if
  
          if RemainingTime_Days > 1 then
                    set RemainingTime_Days to RemainingTime_Days & " days "
                    set RemainingTimeCounter to RemainingTime_Days
          else
                    if RemainingTime_Days = 1 then
                              set RemainingTime_Days to RemainingTime_Days & " day " & (RemainingTime_Hours - 24) & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
                              set RemainingTimeCounter to RemainingTime_Days
  
                    else
                              set RemainingTimeCounter to "" & RemainingTime_Hours & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
  
                    end if
          end if
          set PercentDisplay to "" & (i / CounterLimit * 100) * 100 div 100 & "," & (i / CounterLimit * 100) * 100 mod 100 * 100 div 100 & "%"
  
          set ProgressSpeedDisplay to ((((i - (StartItem - 1)) / (timeItTook / 60))) as string) & " words/min"
  
          set CounterEntry to ("Processing item " & i & " of " & CounterLimit & "   (" & PercentDisplay & ")" & return & return & NewWord & return & "Processed:" & tab & (i - (StartItem - 1)) & return & "Remaining:" & tab & (CounterLimit - i) & return & return & return & "Runtime: " & timeCounter & return & "Remaining: " & RemainingTimeCounter & return & "Speed: " & ProgressSpeedDisplay) as «class utf8»
end CompileCounterEntry

on WriteCounterEntry(CounterEntry, LogFilePath)
          set LogFile to open for access LogFilePath with write permission
  set eof of CounterLog to 0 --emptying file contents if needed
  write LogEntry to LogFile starting at eof as «class utf8»
  close access LogFile
end WriteCounterEntry

Hope this works for you

Best regards,

Queerly

Reply

Answer 8

Hiroto

Level 5

7,461 points

Mar 9, 2013 2:20 AM in response to Queerly

Hello Queerly,

I have completely overlooked the possibility of search returning "not found" result...

Here's a revised code to handle such case correctly, which also added time_out property to set the maximum time to wait for each operation to complete. Currently time_out is set to 10 [sec] which means it throws error when data is not ready in 10 sec after request is sent. This time_out feature is only to keep script from being stuck in loop in unexpected cases. When search returns "not found" result within time_out period, revised query_word() handler returns empty string.

So the usage would be something like this:

try
    set query_results to query_word(q)
on error errs number errn
    -- log error
end try
if query_results = "" then
    -- log no-result
else
    -- process query_results
end if

Here's the script.

--------------------------------

set q to "汉语"
--set q to "仏" -- not found in database
--set q to "佛"
set r to query_word(q)
--set the clipboard to r
return r

on query_word(query)
    script o
        property _url : "http://www.trainchinese.com/v1/a_user/index.php"
        property delta : 0.2 -- interval to check if data is ready [sec]
        property time_out : 10 -- time to wait for data to be ready [sec]

        property js0 : "// document ready status
// 0 = not ready; 1 = ready
document.readyState == 'complete' ? 1 : 0;
"
        property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
        property js2 : "// table ready status
// 0 = result is not ready; 1 = table is ready; 2 = not found
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE') ? 1 :
div.hasChildNodes() && (div.firstChild.tagName == 'DIV')   ? 2 : 0;
"
        on _js2a(x)
            "// click 'get more results' button with given id x if it exists
var tr = document.getElementById('" & x & "');
if (tr != null) { tr.firstChild.firstChild.click(); }
tr != null;
"
        end _js2a

        on _js2a1(x)
            "// existence status of 'get more results' button with given id x
// 0 = the button exists; 1 = the button does not exist
var tr = document.getElementById('" & x & "');
tr == null ? 1 : 0;
"
        end _js2a1

        property js3 : "// collect text and mp3 urls from result table
function main() 
{
    var div = document.getElementById('searchresultWindow');
    var table = div.firstChild;                 // table element
    var trs = table.rows;                       // tr elements collection

    var re = /([^']+\\.mp3)/gi;                 // regexp pattern to match mp3 url
    var rr = [];                                // tr text data array
    for (var i = 0; i < trs.length; ++i)
    {
        var tr = trs.item(i);                   // tr element
        var f = tr.getAttribute('onclick');     // f = onclick function string
        var m;                                  // m = regexp match result array
        var url =                               // mp3 url
            (f != null && (m = f.match(re)) != null)
                ? m[0] 
                : '';

        var tds = tr.childNodes;                // td elements collection
        var dd = [];                            // td text data array
        for (var j = 0; j < tds.length; ++j)
        {
            var td = tds.item(j);               // td element
            dd.push(collectText(td));
        }
        dd.push(url);                           // append url to td text data array
        rr.push( dd.join('\\t') );              // tds data delimited by tab
    }
    return rr.join('\\n');                      // trs text delimited by linefeed
}
function collectText(td)
//    Node td : HTML td element
//    return string : text data extracted from td element
{
    if (td.nodeType == 3) { return td.nodeValue; }    // nodeType == 3 : Text Node

    var tt = [];
    for (var i = 0; i < td.childNodes.length; ++i)
    {
        tt.push( collectText( td.childNodes.item(i) ) );
    }
    return tt.join('');
}
main();
"
        tell application "Safari"

            -- (0) get reference of search window (tab)
            -- get existing window (tab) if present
            set _tabs to tab 1 of windows whose URL = _url
            set _tab to missing value
            repeat with t in _tabs
                set t to t's contents
                if t is not missing value then
                    set _tab to t
                    exit repeat
                end if
            end repeat
            -- create new window (tab) if not present
            if _tab = missing value then
                make new document with properties {URL:_url}
                delay 1 -- for some reason, some delay is required here in some cases
                set _tab to window 1's current tab
                -- wait till document ready
                set r to 0
                repeat (time_out div delta) times
                    set r to do JavaScript js0 in _tab
                    if r = 1 then exit repeat
                    delay delta
                end repeat
                if r = 0 then error "query_word(): document not ready in " & time_out & " [sec]" number 8000
            end if
            --return _tab

            -- (1) query word
            do JavaScript js1 in _tab

            -- (2) wait till result is ready
            set r to 0
            repeat (time_out div delta) times
                set r to do JavaScript js2 in _tab
                if r = 2 then return "" -- not found
                if r = 1 then exit repeat -- table readay
                delay delta
            end repeat
            if r = 0 then error "query_word(): result not ready in " & time_out & " [sec]" number 8001

            -- (2a) get more results
            (*
                'get more results' button is in a tr element with id such as -
                    sr_more0, sr_more20, sr_more40, sr_more60, ...
            *)
            repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
                set id_ to "sr_more" & i
                if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click button with id_
                    -- wait till result table ready
                    set r to 0
                    repeat (time_out div delta) times
                        set r to do JavaScript (my _js2a1(id_)) in _tab
                        if r = 1 then exit repeat
                        delay delta
                    end repeat
                    if r = 0 then error "query_word(): table not ready in " & time_out & " [sec]" number 8002
                else -- button with this id_ not exist
                    exit repeat
                end if
            end repeat

            -- (3) collect text and mp3 urls from result window
            return do JavaScript js3 in _tab
        end tell
    end script
    tell o to run
end query_word

--------------------------------

And thank you for posting your entire script. It is very nice of you to be considerate of the proper use of the site. 😉

I'll peruse it later. Also I'm not using any flashcard programme for now but will check some.

Oh, and as for creating directories, it's simple. You may use "mkdir -p" like this.

set dir_path to "/Users/chinskycraze/Desktop/!NCIKU_Dictionaries/Output/Audio"
do shell script "mkdir -p " & dir_path's quoted form

The "-p" option let it create all sub directories in the path if not present.

Best wishes from Japan,

Hiroto

Reply