Data extraction from HTML using AppleScript.

I attempted to do some data extraction using some Applescript found here and elsewhere, with some modifications that I could figure out, then gave it up for a while. But this is still bugging me.


Here's what the HTML source looks like.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


A lot of unneeded data to start.


Portfolio Header Data


<h2 class="red"><a href="http://www.stockresearch.com/my/portfolio/portfolio1" class="red">Portfolio #1</a></h2>


Begin Stock Data


<div class="row lightgrey">


<div class="company"><a href="http://www.stockresearch.com/my/portfolio/stock-detail/324/" class="red">Company #1</a></div>


<div class="sym-price"> <a href="javascript:goQuote1('T.ABC');">T.ABC</a><br>$12.87</div>


<div class="sym-price"> <a href="javascript:goQuote1('ABC');">ABC</a><br>$13.05</div>


<div class="rec"> <a style="cursor: pointer;" id="recTip_4" onMouseOver="bnToolTip.showRec(this.id)" onMouseOut="bnToolTip.hideRec();">Best B; B 1st tranche at $13, 2nd at $12</a></div>


<div class="last-updated">01/08/2013</div>


<div class="add-to-my-stocks"><input type="checkbox" name="stock[]" value="324"/></div> <div class="clr"></div>


End Stock Data


A lot of unneeded data to end.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


The data I hope to extract is bolded and blued above.


Here's how the data is grouped.


One <h2>, (Portfolio Header Data)


followed by 1 - 50 groups of Stock Data (from class="company" to class="last-updated").


ETC.



The above will repeat five times, though it might be more in the future.


At least some prior attempts have passed compile, but would error out when run.


Extraction to a csv would be ideal, though a text file or similar is just fine too.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


How can I proceed with this? Any help is much appreciated.

MacBook Pro (15-inch Late 2008), OS X Mountain Lion (10.8.2), 2.4GHz CPU, 8GB Ram, 250 GB HD

Posted on Jan 16, 2013 12:44 AM

Reply
5 replies
Sort By: 

Jan 16, 2013 8:31 AM in response to MattSh

I'd tend to avoid AppleScript here, and look around for existing tools that can scrape web pages. Here and here are some related discussions.


These existing scraping tools also tend to be better at either adapting to page changes or allowing the scraper to be reworked easily, given web pages do tend to change.


It's also common for web sites to offer an API for retrieving information without resorting to scraping, and that'd usually be more effective than scraping the page. This because the API avoids the overhead of scraping on the target site, and reduces the effects of HTML layout changes.


Rather than AppleScript, Python and some other programming tools also tend to be much better at processing the vagaries of HTML, with their available libraries and tools. AppleScript is centrally around scripting GUI applications on OS X, and isn't really intended for nor provisioned as a generlc HTML-ready programming tool.

Reply

Jan 16, 2013 9:33 AM in response to MrHoffman

Can you recommend one that is well documented, not only how to use, but how to set up not only the program, but how to make sure other needed components are in place. I remember I did try something like this over a year ago, but I never got the thing to run. And there really wasn't any documentation of note to explain what went wrong and how to troubleshoot it.


Thanks.

Reply

Jan 16, 2013 9:33 AM in response to MattSh

As mentioned above, AppleScript is poorly equipped to deal with this - mostly because the browsers implement really poor AppleScript support for walking through HTML.


The best bet if you do want an AppleScript solution is to simply have AppleScript call a JavaScript function to extract the data. Something like (untested):


tell application "Safari"

set companyName to do JavaScript "document.getElementsByClassName['company']" in document 1

end tell


which should return a list of companies, which would be easier to parse in AppleScript.

Reply

Jan 16, 2013 10:19 AM in response to MattSh

Ah, all ye of little applescript faith...


Without going into details (on the way out the door, sorry), the approach I'd take here is to add in the following procedure:


set {oldTID, my text item delimiters} to {my text item delimiters, {"<", ">"}}

set htmlBits to text items of htmlSource

set my text item delimiters to oldTID

The variable htmlBits will be a list of all the various html code elements and data. You can step through it searching for keywords and taking the following item as the value. for example:


repeat with i from 1 to count of htmlBits

if item i of htmlBits contains "http://www.stockresearch.com" then

set companyName to item (i + 1) of htmlBits

end if


-- …

end repeat

It's not pretty, but it's not difficult.


I wish System Events had an HTLM parser along with its XML and property list parsers, but we make do.

Reply

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Data extraction from HTML using AppleScript.

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.