Hi Hiroto,
This is mighty. I'd never come up with that, simply don't know enough JavaScript. I found, however, one problem with the script in the current form. It works fine provided the search term is in the dictionary. If not, it'd be locked in an infinite loop waiting for the results which do not exist to load. It'd be great if you could find the time to revise the script to check whether any results are returned. I can't do it myself, cos the script is too complex for my understanding.
However, where I could help is getting the files to be downloaded. I wrote the script that will save the mp3 in a separate Audio folder (the subfolder must exist in the output directory, I didn't know how to create it per shellscript, so I just did it manually) under their Chinese names. It will also allow you to specify a Suffix to the file name (you may leave it blank if you don't want to, but I use it to distinguish between files pronounced by male and female speakers - the database is all female).
The script that will also perform a search on a number of new words from a text file with one word per line. It will save the results to two tab delimited text files, one with entries with audio and a second one with entries without audio. It will also extract the part of speach information into a separate field and save these as well. Additionaly, the textfile for entries with audio files will have an extra collumn with the following entry "[sound:" & Filename.mp3 & "]", cos that's the tag my flashcard uses to incorporate audio files. There is also an extra check column with the status of the download e.g. success or failed. If you find "failed" in the textfile for entries without audio, then it means the script failed at extracting the proper URL for the mp3. That did not happen with me so far.
Oh yeah, there's a dialog box asking you to specyfy delay time between separate downloads of mp3. It's to simulate more human-like behaviour so that the server does not feel under attack. The default time is 5 sec.
Also, more importantly, the script checks the audio folder for already existing files and will check before downloading an mp3 if it is not already downloaded so that duplicate files will be skipped. It's to protect the server as well.
There's also an extra file being generated, called !Counter.txt. It's there to indicate the progress of the current task and to help you find out what was the last successfullly processed item in case the script crashes or the internet connection drops.
Sorry if the script seems a bit messy. I'm not very experience with programming. I'm sure you could do a much neater job out of it. But I hated the fact that you end up doing all the work while I'm just sitting here waiting for the end result like a spoilt brat. I wanted to help out a bit too.
So, if you can correct the infinite loop problem and maybe add the automatic generation of "Audio" subfolder in the output folder if it does not alrady exist there, I think we'd be done with the script. Though, of course, feel free to improve the script even further.
set NewWordList to paragraphs of (read (choose file with prompt "Open List of new files") as «class utf8»)
set OutputFolderPath to (choose folder with prompt "Open Output Folder")
set OutputAudioPath to OutputFolderPath & "Audio:" as string
set OutputLogFilePath to OutputFolderPath & "!Search_Results_with_Audio.txt" as string
set SkippedLogFilePath to OutputFolderPath & "!Search_Results_NO_Audio.txt" as string
set CounterLogFilePath to OutputFolderPath & "!!!Progress_Counter.txt" as string
set PosixOutputAudioPath to POSIX path of OutputAudioPath
--return PosixOutputAudioPath
do shell script "mkdir \"$/Users/chinskycraze/Desktop/!NCIKU_Dictionaries/Output/Audio\""
set StartItem to text returned of (display dialog "Set start item to:" default answer 1)
set DelayTime to text returned of (display dialog "In order to prevent misuse of the Online Dictionary and make this search seem more natural to the administrator of the domane, please choose the length of delay (in sec) between downloading each audio file." default answer 5)
set FileNameSuffix to text returned of (display dialog "Each audio file will be automatically named after the Chinese entry, but you may choose to add as suffix at the end of the filename of each file? What suffix should be added? (Leave out blank, if you do not want to modify the name at all.)" default answer "_f")
set DownloadedCache to (list folder OutputAudioPath without invisibles)
repeat with i from 1 to count of DownloadedCache
set item i of DownloadedCache to replaceString("_f.mp3", "", item i of DownloadedCache)
end repeat
set startTime to do shell script "date +%s"
set CounterLimit to count of NewWordList
repeat with i from StartItem to CounterLimit
set NewWord to item i of NewWordList
set query_results to query_word(NewWord)
download_results(query_results)
set CounterEntry to CompileCounterEntry(i)
WriteLogEntry(CounterEntry, CounterLogFilePath)
end repeat
on download_results(query_results)
global FileNameSuffix, OutputFolderPath, OutputAudioPath, FileNameSuffix, DelayTime, OutputLogFilePath, SkippedLogFilePath, DownloadedCache
set SearchList to paragraphs of replaceString(" ", "", query_results)
repeat with i from 1 to count of SearchList
set currentItem to item i of SearchList
if currentItem is not "" then
if currentItem is not " Chinese Pinyin Translation " then
set SimplifiedChinese to ""
set TraditionalChinese to ""
set PinyinEntry to ""
set AudioDownloadURL to ""
set TranslationEntry to ""
set PartofSpeach to ""
set DownloadStatus to ""
set FileName to ""
set tid to AppleScript's text item delimiters -- save them for later
set AppleScript's text item delimiters to tab
set currentItemCont to text items of currentItem
set SimplifiedChinese to item 1 of currentItemCont
set TraditionalChinese to item 2 of currentItemCont
set PinyinEntry to item 3 of currentItemCont
set TranslationEntry to item 4 of currentItemCont
set AudioDownloadURL to last item of currentItemCont
set AppleScript's text item delimiters to tid -- back to original values.
set PartofSpeach to extractBetween("[", "]", TranslationEntry)
set TranslationEntry to replaceString("[" & PartofSpeach & "]", "", TranslationEntry)
if DownloadedCache does not contain SimplifiedChinese then
if AudioDownloadURL is not "" then
set AudioDownloadURL to "http://www.trainchinese.com/v1/" & replaceString("../", "", AudioDownloadURL)
set FileName to SimplifiedChinese & FileNameSuffix as string
set DownloadStatus to DownloadAudio(AudioDownloadURL, FileName)
set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
WriteLogEntry(LogEntry, OutputLogFilePath)
set DownloadedCache to DownloadedCache & SimplifiedChinese
delay DelayTime
else -- no audio
set LogEntry to CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
WriteLogEntry(LogEntry, SkippedLogFilePath)
-- skipped
end if
end if
end if
end if
end repeat
end download_results
on query_word(query)
script o
property _url : "http://www.trainchinese.com/v1/a_user/index.php"
property delta : 0.2
property js0 : "// document ready?
document.readyState == 'complete';
"
property js1 : "// query word
var query = '" & query & "';
document.getElementById('searchWord').value = query;
document.getElementById('srchEnglishBtn').click();
"
property js2 : "// table ready?
var div = document.getElementById('searchresultWindow');
div.hasChildNodes() && (div.firstChild.tagName == 'TABLE');
"
on _js2a(x)
"// click 'get more results' button with given id x if it exists
var tr = document.getElementById('" & x & "');
if (tr != null) { tr.firstChild.firstChild.click(); }
tr != null;
"
end _js2a
on _js2a1(x)
"// check if 'get more results' button with given id x no longer exists
var tr = document.getElementById('" & x & "');
tr == null;
"
end _js2a1
property js3 : "// collect text and mp3 urls from result table
function main()
{
var div = document.getElementById('searchresultWindow');
var table = div.firstChild; // table element
var trs = table.rows; // tr elements collection
var re = /([^']+\\.mp3)/gi; // regexp pattern to match mp3 url
var rr = []; // tr text data array
for (var i = 0; i < trs.length; ++i)
{
var tr = trs.item(i); // tr element
var f = tr.getAttribute('onclick'); // f = onclick function string
var m; // m = regexp match result array
var url = // mp3 url
(f != null && (m = f.match(re)) != null)
? m[0]
: '';
var tds = tr.childNodes; // td elements collection
var dd = []; // td text data array
for (var j = 0; j < tds.length; ++j)
{
var td = tds.item(j); // td element
dd.push(collectText(td));
}
dd.push(url); // append url to td text data array
rr.push( dd.join('\\t') ); // tds data delimited by tab
}
return rr.join('\\n'); // trs text delimited by linefeed
}
function collectText(td)
// Node td : HTML td element
// return string : text data extracted from td element
{
if (td.nodeType == 3) { return td.nodeValue; } // nodeType == 3 : Text Node
var tt = [];
for (var i = 0; i < td.childNodes.length; ++i)
{
tt.push( collectText( td.childNodes.item(i) ) );
}
return tt.join('');
}
main();
"
tell application "Safari"
-- (0) get reference of search window (tab)
-- get existing window (tab) if present
set _tabs to tab 1 of windows whose URL = _url
set _tab to missing value
repeat with t in _tabs
set t to t's contents
if t is not missing value then
set _tab to t
exit repeat
end if
end repeat
-- create new window (tab) if not present
if _tab = missing value then
make new document with properties {URL:_url}
delay 1 -- for some reason, some delay is required here in some cases
set _tab to window 1's current tab
-- wait till document ready
repeat until (do JavaScript js0 in _tab)
delay delta
end repeat
end if
--return _tab
-- (1) query word
do JavaScript js1 in _tab
-- (2) wait till result table ready
repeat until (do JavaScript js2 in _tab)
delay delta
end repeat
-- (2a) get more results
(*
'get more results' button in in a tr element with id such as -
sr_more0, sr_more20, sr_more40, sr_more60, ...
*)
repeat with i from 0 to 400 by 20 -- 400 shoud be large enough
set id_ to "sr_more" & i
if (do JavaScript (my _js2a(id_)) in _tab) then -- try to click more button with id_
-- wait till result table ready
repeat until (do JavaScript (my _js2a1(id_)) in _tab)
delay delta
end repeat
else -- more button with this id_ not exist
exit repeat
end if
end repeat
-- (3) collect text and mp3 urls from result window
return do JavaScript js3 in _tab
end tell
end script
tell o to run
end query_word
on replaceString(oldString, newString, TextInput)
local ASTID, oldString, newString, lst
set ASTID to AppleScript's text item delimiters
try
considering case
set AppleScript's text item delimiters to oldString
set lst to every text item of TextInput
set AppleScript's text item delimiters to newString
set TextOutput to lst as string
end considering
set AppleScript's text item delimiters to ASTID
return TextOutput
on error eMsg number eNum
set AppleScript's text item delimiters to ASTID
error "Can't replaceString: " & eMsg number eNum
end try
end replaceString
on extractBetween(startText, endText, TextInput)
if TextInput is "" then return TextInput
set tid to AppleScript's text item delimiters -- save them for later.
set AppleScript's text item delimiters to startText -- find the first one.
set endItems to text of text item -1 of TextInput -- everything after the first.
set AppleScript's text item delimiters to endText -- find the end one.
set TextOutput to text of text item 1 of endItems -- get the first part.
set AppleScript's text item delimiters to tid -- back to original values.
return TextOutput
end extractBetween
on DownloadAudio(AudioDownloadURL, FileName)
global OutputAudioPath
set OutputAudioPath to POSIX path of OutputAudioPath
--try
----display dialog "curl " & AudioDownloadURL & " -o " & theOutputFolder & FileName & DatabaseStamp & ".mp3"
do shell script "curl " & AudioDownloadURL & " -o " & OutputAudioPath & FileName & ".mp3"
return "success"
--on error
--return "failed"
--end try
end DownloadAudio
on WriteLogEntry(LogEntry, LogFilePath)
set LogFile to open for access LogFilePath with write permission
write LogEntry to LogFile starting at eof as «class utf8»
close access LogFile
end WriteLogEntry
on CompileLogEntry(SimplifiedChinese, TraditionalChinese, PinyinEntry, TranslationEntry, PartofSpeach, FileName, DownloadStatus)
if DownloadStatus is "success" then
return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & "[sound:" & FileName & ".mp3]" & tab & DownloadStatus & return
else
return SimplifiedChinese & tab & TraditionalChinese & tab & PinyinEntry & tab & TranslationEntry & tab & PartofSpeach & tab & DownloadStatus & return
end if
end CompileLogEntry
on CompileCounterEntry(i)
global startTime, StartItem, CounterLimit, NewWord
set currentTime to do shell script "date +%s"
set timeItTook to (currentTime - startTime)
set timeItTookInSec to timeItTook mod 3600 mod 60 as integer
set timeItTookInMin to timeItTook mod 3600 div 60 as integer
set timeItTookInHours to timeItTook div 3600 as integer
if timeItTookInSec < 10 then
set timeItTookInSec to "0" & timeItTookInSec
end if
if timeItTookInMin < 10 then
set timeItTookInMin to "0" & timeItTookInMin
end if
set timeCounter to "" & timeItTookInHours & ":" & timeItTookInMin & ":" & timeItTookInSec
if timeItTook is 0 then set timeItTook to 1
set ProgressSpeed to ((i - (StartItem - 1)) / timeItTook)
set RemainingTime to (CounterLimit - i) / ProgressSpeed
set RemainingTime_Sec to RemainingTime mod 3600 mod 60 as integer
set RemainingTime_Min to RemainingTime mod 3600 div 60 as integer
set RemainingTime_Hours to RemainingTime div 3600
set RemainingTime_Days to ((RemainingTime div 3600) div 24)
if RemainingTime_Sec < 10 then
set RemainingTime_Sec to "0" & RemainingTime_Sec
end if
if RemainingTime_Min < 10 then
set RemainingTime_Min to "0" & RemainingTime_Min
end if
if RemainingTime_Days > 1 then
set RemainingTime_Days to RemainingTime_Days & " days "
set RemainingTimeCounter to RemainingTime_Days
else
if RemainingTime_Days = 1 then
set RemainingTime_Days to RemainingTime_Days & " day " & (RemainingTime_Hours - 24) & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
set RemainingTimeCounter to RemainingTime_Days
else
set RemainingTimeCounter to "" & RemainingTime_Hours & ":" & RemainingTime_Min & ":" & RemainingTime_Sec
end if
end if
set PercentDisplay to "" & (i / CounterLimit * 100) * 100 div 100 & "," & (i / CounterLimit * 100) * 100 mod 100 * 100 div 100 & "%"
set ProgressSpeedDisplay to ((((i - (StartItem - 1)) / (timeItTook / 60))) as string) & " words/min"
set CounterEntry to ("Processing item " & i & " of " & CounterLimit & " (" & PercentDisplay & ")" & return & return & NewWord & return & "Processed:" & tab & (i - (StartItem - 1)) & return & "Remaining:" & tab & (CounterLimit - i) & return & return & return & "Runtime: " & timeCounter & return & "Remaining: " & RemainingTimeCounter & return & "Speed: " & ProgressSpeedDisplay) as «class utf8»
end CompileCounterEntry
on WriteCounterEntry(CounterEntry, LogFilePath)
set LogFile to open for access LogFilePath with write permission
set eof of CounterLog to 0 --emptying file contents if needed
write LogEntry to LogFile starting at eof as «class utf8»
close access LogFile
end WriteCounterEntry
Hope this works for you
Best regards,
Queerly