Can Applescript Automate Website Data Entry?

Question

Level 1

0 points

Can Applescript Automate Website Data Entry?

Greets all,

I'm a grad student in bioinformatics and I am trying to come up with a way to save the members of my lab from long hours (days, months) of mind-numbing, repetitive data entry. I already know some Perl and that helps a lot, but for some of the things we do, it just doesn't cut it. So, I bought Applescript: A Definitive Guide yesterday in the hopes that it can save us.

Essentially, I'd like to get Text Editor, Perl, and Safari to talk to one another via Applescript, and maybe Microsoft Word, too. I'm sure this is possible from what little reading I've done so far, but wanted to ask about whether a particular thing were possible or not with Safari (or any web browser, for that matter).

Basically, there are a few websites that we must go to to get certain bits if information on the human and other genomes. This involves going to the website, pasting text into a field on the website, and clicking a button (essentially a "submit" button) on the website. Other than that, "clicking" on a link that is provided and copying some text out of the page that link goes to is also needed. There may also be a few drop down menus that it would be groovy to automate as well, but these are less critical as I think they can be set as defaults and left unchanged for large numbers of runs.

I would appreciate anyone who has done a lot of Applescript letting me know if this type of thing is even feasible before I spend too many days trying to learn enough of the basics to realize for myself that I'm trying to do the impossible.

Thanks much!

Dual 2.7 GHz PowerPC G5 w/ 4 GB DDR SDRAM Mac OS X (10.4.6) Really, OS X (10.4.7), but that's not an option from the pull down menu.

Posted on Jun 30, 2006 12:56 PM

Reply

Answer 1

Cyclosaurus

Level 6

12,915 points

Jun 30, 2006 1:15 PM in response to Baka

Questions:

➊ What is the form submit method? cgi, php, asp, javascript?
➋ What is the format of the result page?
➌ What do you want to do with the result web page?

Reply

Answer 2

Baka Author

Level 1

0 points

Jun 30, 2006 2:10 PM in response to Cyclosaurus

First, thanks Cyclosaurus for replying.

➊ What is the form submit method? cgi, php, asp, javascript?

I'm not really familiar with web programming. A friend in the lab tells me it's cgi.

➋ What is the format of the result page?

Friend again says cgi.

➌ What do you want to do with the result web page?

Well, there will be a series of three web pages, all part of a service called BLAT that lets you query genomes for matches to a sequence of DNA you have. The first has a text field I want to copy text into (which I'll get from a text file probably created by TextEdit) and then click the "Submit" button. Then the second page will load with a list of links. I will want to click the top one of these, which will take me to the third page. On this third page, I will want to copy text from it and paste this text into TextEdit (or just a plain text file so I can chew on it in Perl).

If it would be helpful for you to see the pages in question, I'll provide some links and instructions below:

First page: http://genome.ucsc.edu/cgi-bin/hgBlat
I want to go here, paste some text (a DNA sequence) into the text field, and click the "submit" button just below the text field.

Second page: the URL of the second page is the same as the first ... so you'll need to actually put some DNA sequence data into the field to see it. I tried to post a picture I got with Grab, but it doesn't seem to have worked. See below for some text you could post. Post this text into the text field of the first page, then hit "submit". What loads will be a page with lots of links down the left side. Clicking on the top link that says "details" gets you to the third page I'm interested in.

Third page: the URL of the third page, which is different for each submission, is as follows: http://genome.ucsc.edu/cgi-bin/hgc?o=66367814&g=htcUserAli&i=../trash/hgSsgenome_8969_1151700234.pslx..%2Ftrash%2FhgSs_genome_89691151700234.faYourSeq&c=chr17&l=66367814&r=66370436&db=rheMac2&hgsid=74421853

I have no idea if this will work by the time you try it. If it does, or if you use the data I've provided below to actually run a sequence yourself, you'll see a page that has two long DNA sequences and some other stuff like alignments and position information. I would want to copy the sequences out of this page and into a text file, though if I had to copy all the text instead of just the sequences, that's fine too as I could write a Perl script that got rid of everything else.

Okay, here's a little bit of DNA code you should be able to copy-and-paste into the first BLAT page.

GGTAGCTTCTTTTGCTGTGCAGAAGCTCTTTAGTTTAATTAGATCCCATT
TGTCAATGTTGGCTTTAGTTGCCATTGCTTTTGGTGTTTTAGACATGAAG
TCCTTGCCCATGCCTATGTCCTGAATGGTATTGCCTAAGTTTTCTTCTAG
GGTTTTTATGGTTTTAGGTCTAACATTTAAGTCTCTAATCCATCTCAAAT
TAATTTTTGTATAAAGTGTAAGGAAGGGATCCAGTTTCAGCTTTCTACTT
ATGGCTAGCCAGTTTTTCCAGCACCATTTATTAAATAGGGAATCCTTTCC
CCATTTCTTATTTTTGTCAGGTTTGTCAAAGATCAGATCATTGTAGATGT
ATGGTATTATTTCTGAGGGCTCTGTTCTGTTCCATTGGTCTATATCTCTG
TTTCGGTATCAGTACCATGCTGTTTTGGTTACTGTAGCCTTGTAGTATAG
TTTGAAGTCAGGTAGCGTGATGCCTCCAGCTTTGTTCTTTTGGCTTAGAA
TTGTCTTGCCAATGCAGGCTCTTTTTTGGTTCCATATGGACTTTAAAGTA
GTTTTTTCTAATTCTGTGAAGAAAGTCATTGGTAGCTTAATGGGGATGAC
ATTGAACCCTATAAATTACTTTGGGCAGTATGGCCACTTTCACAATATTG

Dual 2.7 GHz PowerPC G5 w/ 4 GB DDR SDRAM Mac OS X (10.4.6)

Reply

Answer 3

Cyclosaurus

Level 6

12,915 points

Jun 30, 2006 4:21 PM in response to Baka

It seems to be do-able with AS and shell.

The first page post data through a cgi command, this command require a set of parameters: hgsid, org, db, type, sort, output and userSeq
You can use shell curl command to submit the data. But it seems like you try to re-invent the wheel here. Because you have to write an AppleScript (or AppleScript Studio app) wrapper to choose the parameter.

Second page is the result of first page, it has zillion links, there tools to extract them painlessly. but, do you want data from ALL the links? or just the top score link?
You need links (a perl script) to extract the links.

Third page is pretty much text with some HTML tag for formatting.
You can use use Safari to grab the text data.

You need to get link perl script, go here: file:///System/Library/Automator/Get Link URLs from Webpages.action
Control click it --> Show Package Contents --> Contents --> Resources
Then copy links to /usr/local/bin directory

Now, open "http://genome.ucsc.edu/cgi-bin/hgBlat" with Safari, paste in your data, submit --> second page, leave the second page in front, then run the following script:

-- begin
set AppleScript's text item delimiters to {""}
set base_URL to "http://genome.ucsc.edu"

-- Safari to get source of the score results page
tell application "Safari" to set results_source to source of document 1

-- copy it to /tmp so we can parse it
do shell script "/bin/echo " & quoted form of results_source & " > /tmp/results_source.txt"
set _links to do shell script "/usr/local/bin/links file:///tmp/results_source.txt"
set result_links to {}
-- start parsing for hgc? links
repeat with _paragraph in paragraphs of _links
if _paragraph contains "hgc?" then
set AppleScript's text item delimiters to {"file:///"}
copy {(base_URL & last text item of _paragraph) as string} to end of result_links
end if
end repeat

set AppleScript's text item delimiters to {""}

tell application "Safari"
-- open the first link, result_links can be in a repeat loop for all links
open location ((item 1 of result_links) as string)
delay 2
-- get the sequence data
set text_result to text of document "User Sequence vs Genomic"
end tell
return text_result
-- end

You can use it as a base to expand on.

Reply

Answer 4

Baka Author

Level 1

0 points

Jun 30, 2006 4:36 PM in response to Cyclosaurus

Wow! Thanks for the wealth of information, Cyclosaurus. I'm not sure I completely understand it all right now, but I'll set about looking up the things you've given me.

One quick question about this comment of yours:

But it seems like you try to re-invent the wheel
here. Because you have to write an AppleScript (or
AppleScript Studio app) wrapper to choose the
parameter.

Is there a way to do this without re-inventing the wheel? And, if that is the way you have detailed, then great! I just want to make sure you understand that I don't want to re-invent the wheel if I don't have to here. If there are already existing Perl scripts and Applescripts that can accomplish it, that's precisely what I'm looking for.

And now, to answer your question:

Second page is the result of first page, it has
zillion links, there tools to extract them
painlessly. but, do you want data from ALL the links?
or just the top score link?

Nope. I really only should need the top link. In the future, it might be nice to be able to get, say, the top three or four links ... but currently, the top link is what we go with 99% of the time.

Thanks again! Let me know if my answer to your "re-inventing the wheel" comment above changes any of your suggestions! I'll start trying to implement this tomorrow and let you know how it goes then.

Dual 2.7 GHz PowerPC G5 w/ 4 GB DDR SDRAM Mac OS X (10.4.6)

Reply

Answer 5

Cyclosaurus

Level 6

12,915 points

Jun 30, 2006 6:02 PM in response to Baka

Is there a way to do this without re-inventing the
wheel? And, if that is the way you have detailed,
then great! I just want to make sure you understand
that I don't want to re-invent the wheel if I don't
have to here. If there are already existing Perl
scripts and Applescripts that can accomplish it,
that's precisely what I'm looking for.

Well actually, the question is:

Will the parameters change in you DNA sequence search? Except for the DNA sequence itself.

In the first page: http://genome.ucsc.edu/cgi-bin/hgBlat , there are five other settings beside your DNA sequence: Genome, Assembly, Query type, Sort output and Output type.
If those parameters are fixed (the only change is DNA seq), then there is a better way to get what you want, with AppleScript and shell.
If those parameters are changed from time to time then (I think) my previous sample is better. Because of the nature of AS, AS is not design for heavy GUI, so it isn't suit to design a GUI app.

Nope. I really only should need the top link. In
the future, it might be nice to be able to get, say,
the top three or four links ... but currently, the
top link is what we go with 99% of the time.

My previous sample picks the first result in the the search, it can be easily modified to pick the top five results.

Reply

Answer 6

dev_sleidy

Level 4

1,570 points

Jul 1, 2006 6:27 AM in response to Baka

'... letting me know if this type of thing is even feasible ...' - yes.

The following code, works for me:

<pre style="font-family: 'Monaco', 'Courier New', Courier, monospace; overflow:auto; color: #222; background: #DDD; padding: 0.2em; font-size: 10px; width:400px">set bURL01 to "http://genome.ucsc.edu"
set bURL02 to bURL01 & "/cgi-bin/hgBlat"

if ((my verify_DNA(the clipboard)) = 0) then
tell application "Safari"
activate

make new document -- Make a new document.

set (URL of document 1) to bURL02 -- Go to url web page.

repeat -- Wait until the web page is completed loaded.
if ((name of document 1) is "Human BLAT Search") then exit repeat
end repeat
delay 1 -- Wait a second.

end tell

tell application "System Events" -- Required to emulate human selection of web page items.
tell process "Safari" -- Indicates which process to focus automation upon.

repeat 6 times -- Locate and select the 'textarea' field.
keystroke tab using shift down
end repeat

keystroke (the clipboard) -- Paste into the 'textarea' field, the contents of the clipboad.
delay 1 -- Wait a second.

-- Automate the clicking of the 'submit' button.
tell application "Safari" to (do JavaScript "document.mainForm.submit()" in document 1)

end tell
quit -- 'System Events' is a CPU hog. Kill it when you can, until Apple fix'es it.
end tell

tell application "Safari"
activate

repeat -- Wait until the web page is completed loaded.
if ((name of document 1) is "Human BLAT Search") then exit repeat
end repeat
delay 6 -- Wait a second.

set tSource to get (source of document 1) -- Force document 1 to refreshed contents.

-- Locate first URL.
set search_String01 to ">browser</A> <A HREF=\".."
set offset01 to ((offset of search_String01 in tSource) + (count search_String01))
set offset02 to ((offset of "\">details" in tSource) - 1)

-- Display results web page.
set (URL of document 1) to (bURL01 & (get (characters (offset01) through (offset02) in tSource) as string))
end tell
else
say "DNA error"
end if

on verify DNA(localString)
set tList to {"A", "C", "G", "T", ASCII character 10, ASCII character 13}
repeat with i in local_String
if (tList does not contain i) then return -1
end repeat
return 0
end verify_DNA</pre>

-----

Note: A valid DNA sequence must be in the 'clipboard' for the above code to work properly.

Mac OS X (10.4.4)

Reply

Answer 7

Baka Author

Level 1

0 points

Jul 1, 2006 2:18 PM in response to Cyclosaurus

Well actually, the question is:

Will the parameters change in you DNA sequence
search? Except for the DNA sequence itself.

Right now, I think the only thing I think that would ever need to be varied is the first drop-down menu on the left called "Genome". That occasionally will need to be changed from Human to Rhesus or Chimp. I'm going to ask the other fellow I'm working with on this project, in case he can forsee a reason why we absolutely "must" be able to change those other parameters on the fly.

However, I'm pretty sure the website "remembers" the previous settings. So, it's possible that we could do a run of all the sequences we needed from the Human genome, then manually change that one menu to, say, Rhesus, and do all the runs we needed with the Rhesus genome.

So, basically, if we can change it on the fly, then that's awesome. But, if that would entail a lot of extra work and re-inventing of wheels, then I don't think that it would be terrible if we just did separate runs of the script with manual setting of the other parameters on the page before we hit "go".

Reply

Answer 8

Baka Author

Level 1

0 points

Jul 1, 2006 2:22 PM in response to dev_sleidy

The following code, works for me:

Thanks so much, dev. You and Cyclosaurus have given me a ton of stuff to work with. Took me longer to get into the lab today than I thought, so I haven't gotten into trying to get it working on my machine yet, but you've already both been great!

I'll post as soon as I've gotten things ironed out to let you know how it goes.

Reply

Answer 9

Cyclosaurus

Level 6

12,915 points

Jul 1, 2006 3:53 PM in response to Baka

Well, if Genome is the only change-able requirement then it's easy enough, otherwise, you will have to step thru a series of dialog boxes to set other parameters. Try this, it will get the top five results and open in Safari.
Of course the results can be save direct to files, but you need to provide info regarding these files, such as file naming etc...

-- begin
set AppleScript's text item delimiters to {""}
set base_URL to "http://genome.ucsc.edu"

-- this is Genome pick list, you can add more into this list
set genome_list to {"Human", "Chimp", "Rhesus"}
set _genome to (choose from list genome_list without multiple selections allowed) as string
-- paste in DNA seq, you won't see the whole string, but it's all there
set pasted_text to text returned of (display dialog "Paste DNA Sequence here:" default answer "")
set DNA_seq to ""
-- get rid of return
repeat with _paragraph in paragraphs of pasted_text
set DNA_seq to (DNA_seq & _paragraph) as string
end repeat
-- return {_genome, DNA_seq}

-- download the result links, any of these parameters can be change manually
do shell script "/usr/bin/curl 'http://genome.ucsc.edu/cgi-bin/hgBlat' -d hgsid=74425287 -d sort='query,score' -d output=hyperlink -d type=\"BLAT's guess\" -d hg18=\"Mar. 2006\" -d org=" & _genome & " -d userSeq=" & quoted form of DNA_seq & " -o /tmp/results_source.txt"

-- get result links in the result page
set _links to do shell script "/usr/local/bin/links file:///tmp/results_source.txt"
set result_links to {}
set result_count to 0
-- stop parsing links when hit the limit, this determin how results you want to get
set result_limit to 5 -- change if you want more/less results
-- start parsing for hgc? links
repeat with _paragraph in paragraphs of _links
if _paragraph contains "hgc?" then
set result_count to result_count + 1
set AppleScript's text item delimiters to {"file:///"}
copy {(base_URL & last text item of _paragraph) as string} to end of result_links
set AppleScript's text item delimiters to {""}
if result_count is equal to result_limit then exit repeat
end if
end repeat

set AppleScript's text item delimiters to {""}

-- this last part open the result pages in Safari
-- but you can also use curl to direct save the files
tell application "Safari"
repeat with result_link in result_links
make new document
set (URL of document 1) to (result_link as string)
set wait_flag to ""
repeat until wait_flag is "complete"
delay 1
set wait_flag to do JavaScript "document.readyState" in document 1
end repeat
end repeat
-- get the last result data, this is not necessary
set text_result to text of document 1
end tell
return text_result
--end

Reply

Answer 10

Baka Author

Level 1

0 points

Jul 6, 2006 9:46 AM in response to Cyclosaurus

I'm having an error at this line in Cyclosaurus's second program (above):

set _links to do shell script "/usr/local/bin/links file:///tmp/results_source.txt"

This is the error message:

sh: line 1: /usr/local/bin/links: No such file or directory

Seems like I could possibly fix this by simply making a "links" folder inside the "bin" directory?

Reply

Answer 11

Baka Author

Level 1

0 points

Jul 6, 2006 9:53 AM in response to dev_sleidy

Hey dev_sleidy,

I'm getting some strange Safari error when I run the script you gave me. If you have a second, glance at it and tell me what I've done wrong. Thanks for your help.

Basically, when I run the script, it opens the page and goes to the Human BLAT Search, but then it stalls there for a little bit and then I get this error page in the Safari window:

Safari can’t open the page “ http://genome.ucsc.edu"-//W3C//DTD%20HTML%203.2//EN">%0A<HTML>%0A<HEAD>%0A%0A%09%0A%09<META%20HTTP-EQU IV="Content-Type"%20CONTENT="text/html;CHARSET=iso-8859-1">%0A%09<META%20http-eq uiv="Content-Script-Type"%20content="text/javascript">%0A%09<TITLE>%0AHuman%20BL AT%20Search%09</TITLE>%0A%09<LINK%20REL="STYLESHEET"%20HREF="/style/HGStyle.css" >%0A%0A</HEAD>%0A<BODY%20BGCOLOR="#FFF9D2"%20LINK="0000CC"%20VLINK="#330066"%20A LINK="#6600FF">%0A<A%20NAME="TOP">%0A%0A<TABLE%20BORDER=0%20CELLPADDING=0%20CELL SPACING=0%20WIDTH="100%">%0A%0A%0A<TR><TD%20COLSPAN=3%20HEIGHT=40%20>%0A<table%20bgcolor="#000000"%20cell padding="1"%20cellspacing="1"%20width="100%%"%20height="27">%0A<tr%20bgcolor="#2 636D1"><td%20valign="middle">%0A%09<table%20BORDER=0%20CELLSPACING=0%20CELLPADDI NG=0%20bgcolor="#2636D1"%20height="24"><TR>%0A%09%20%09<TD%20VALIGN="middle"><fo nt%20color="#89A1DE"> %0A%0A <A%20HREF="/index.html?org=Human&db=hg18&hgsid=7461 5948"%20class="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20Home%20 %0A%20%20%2 0%20%20%20%20<A%20HREF="/cgi-bin/hgGateway?org=Human&db=hg18&hgsid=74615948"%20c lass="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20Genomes%20 %0A%20%20%20%20%2 0%20%20<A%20HREF="/cgi-bin/hgTables?org=Human&db=hg18&hgsid=74615948&hgta_doMain Page=1"%20class="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20Tables%20 %0A%20% 20%20%20%20%20%20<A%20HREF="/cgi-bin/hgNear?org=Human&db=hg18&hgsid=74615948"%20 class="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20Gene%20Sorter%20 %0A%20%20% 20%20%20%20%20<A%20HREF="/cgi-bin/hgPcr?org=Human&db=hg18&hgsid=74615948"%20clas s="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20PCR%20 %0A%20%20%20%20%20%20%20 <A%20HREF="/FAQ/"%20class="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20FAQ%20 %0A%0A%20%20%20%20%20%20%20<A%20HREF="/goldenPath/help/hgTracksHelp.html#BLATAli gn"%0A%20%20%20%20%20%20%20class="topbar">%0A%20%20%20%20%20%20%20%20%20%20%20He lp%20%0A </TD>%0A%20%20%20%20%20%20%20</TR></TABLE>%0A</TD></TR></TABLE>%0A</TD> </TR>%09%0A%0A%0A%0A<TR><TD%20COLSPAN=3>%09%0A%20%20%09%0A%20%20%09<TABLE%20WIDTH="100%"%20BGCOLOR="#888888"%20BORDER="0 "%20CELLSPACING="0"%20CELLPADDING="1"><TR><TD>%09%0A%20%20%20%20<TABLE%20BGCOLOR ="#FFFEE8"%20WIDTH="100%"%20%20BORDER="0"%20CELLSPACING="0"%20CELLPADDING="0"><T R><TD>%09%0A%09<TABLE%20BGCOLOR="#D9E4F8"%20BACKGROUND="/images/hr.gif"%20WIDTH= "100%"><TR><TD>%0A%09%09<FONT%20SIZE="4"> %0AHuman%20BLAT%20Search</TD></TR></TABLE>%0A%09<TABLE%20BGCOLOR="#FFFEE8"%20WIDTH="100%"%20CELLPADDING= 0><TR><TH%20HEIGHT=10></TH></TR>%0A%09<TR><TD%20WIDTH=10> </TD><TD>%0A%09%0A%0A< FORM%20ACTION="../cgi-bin/hgBlat"%20METHOD="POST"%20ENCTYPE="multipart/form-data "%20NAME="mainForm">%0A

BLAT%20Search%20Genome

%0A<INPUT%20TYPE=HIDDEN%20NAME="hgsid"%20VALUE="74615948"><TABLE%20BORDER=0%20WI DTH=80>%0A<TR>%0A%0A<TD%20ALIGN=CENTER>Genome:</TD><TD%20ALIGN=CENTER>Assembly:< /TD><TD%20ALIGN=CENTER>Query%20type:</TD><TD%20ALIGN=CENTER>Sort%20output:</TD>< TD%20ALIGN=CENTER>Output%20type:</TD><TD%20ALIGN=CENTER>&nbsp</TD></TR>%0A<TR>%0 A<TD%20ALIGN=CENTER>%0A<SELECT%20NAME="org"%20onchange="document.orgForm.org.val ue%20=%20%20document.mainForm.org.options[document.mainForm.org.selectedIndex].v alue;%20%20document.orgForm.seqFile.value%20=%20document.mainForm.seqFile.value; %20%20document.orgForm.userSeq.value%20=%20document.mainForm.userSeq.value;%20%2 0document.orgForm.db.value%20=%200;%20%20document.orgForm.submit();">%0A<OPTION% 20SELECTED%20VALUE="Human">Human</OPTION>%0A<OPTION%20VALUE="Chimp">Chimp</OPTIO N>%0A<OPTION%20VALUE="Rhesus">Rhesus</OPTION>%0A<OPTION%20VALUE="Dog">Dog</OPTIO N>%0A<OPTION%20VALUE="Cow">Cow</OPTION>%0A<OPTION%20VALUE="Mouse">Mouse</OPTION> %0A<OPTION%20VALUE="Rat">Rat</OPTION>%0A<OPTION%20VALUE="Opossum">Opossum</OPTIO N>%0A<OPTION%20VALUE="Chicken">Chicken</OPTION>%0A<OPTION%20VALUE="X.%20tropical is">X.%20tropicalis</OPTION>%0A<OPTION%20VALUE="Zebrafish">Zebrafish</OPTION>%0A <OPTION%20VALUE="Tetraodon">Tetraodon</OPTION>%0A<OPTION%20VALUE="Fugu">Fugu</OP TION>%0A<OPTION%20VALUE="C.%20intestinalis">C.%20intestinalis</OPTION>%0A<OPTION %20VALUE="D.%20melanogaster">D.%20melanogaster</OPTION>%0A<OPTION%20VALUE="D.%20 simulans">D.%20simulans</OPTION>%0A<OPTION%20VALUE="D.%20sechellia">D.%20sechell ia</OPTION>%0A<OPTION%20VALUE="D.%20yakuba">D.%20yakuba</OPTION>%0A<OPTION%20VAL UE="D.%20erecta">D.%20erecta</OPTION>%0A<OPTION%20VALUE="D.%20ananassae">D.%20an anassae</OPTION>%0A<OPTION%20VALUE="D.%20persimilis">D.%20persimilis</OPTION>%0A <OPTION%20VALUE="D.%20pseudoobscura">D.%20pseudoobscura</OPTION>%0A<OPTION%20VAL UE="D.%20virilis">D.%20virilis</OPTION>%0A<OPTION%20VALUE="D.%20mojavensis">D.%2 0mojavensis</OPTION>%0A<OPTION%20VALUE="D.%20grimshawi">D.%20grimshawi</OPTION>% 0A<OPTION%20VALUE="A.%20mellifera">A.%20mellifera</OPTION>%0A<OPTION%20VALUE="A. %20gambiae">A.%20gambiae</OPTION>%0A<OPTION%20VALUE="C.%20elegans">C.%20elegans< /OPTION>%0A<OPTION%20VALUE="C.%20briggsae">C.%20briggsae</OPTION>%0A<OPTION%20VA LUE="S.%20purpuratus">S.%20purpuratus</OPTION>%0A<OPTION%20VALUE="S.%20cerevisia e">S.%20cerevisiae</OPTION>%0A<OPTION%20VALUE="SARS">SARS</OPTION>%0A</SELECT>%0 A</TD>%0A<TD%20ALIGN=CENTER>%0A<SELECT%20NAME="db">%0A<OPTION%20SELECTED%20VALUE ="hg18">Mar.%202006</OPTION>%0A<OPTION%20VALUE="hg17">May%202004</OPTION>%0A<OPT ION%20VALUE="hg16">July%202003</OPTION>%0A<OPTION%20VALUE="hg15">Apr.%202003</OP TION>%0A</SELECT>%0A</TD>%0A<TD%20ALIGN=CENTER>%0A<SELECT%20NAME="type"%20class= normalText>%0A<OPTION%20SELECTED>BLAT's%20guess</OPTION>%0A<OPTION>DNA</OPTION>% 0A<OPTION>protein</OPTION>%0A<OPTION>translated%20RNA</OPTION>%0A<OPTION>transla ted%20DNA</OPTION>%0A</SELECT>%0A</TD>%0A<TD%20ALIGN=CENTER>%0A<SELECT%20NAME="s ort"%20class=normalText>%0A<OPTION%20SELECTED>query,score</OPTION>%0A<OPTION>que ry,start</OPTION>%0A<OPTION>chrom,score</OPTION>%0A<OPTION>chrom,start</OPTION>% 0A<OPTION>score</OPTION>%0A</SELECT>%0A</TD>%0A<TD%20ALIGN=CENTER>%0A<SELECT%20N AME="output"%20class=normalText>%0A<OPTION%20SELECTED>hyperlink</OPTION>%0A<OPTI ON>psl</OPTION>%0A<OPTION>psl%20no%20header</OPTION>%0A</SELECT>%0A</TD>%0A</TR> %0A<TR>%0A<TD%20COLSPAN=5%20ALIGN=CENTER>%0A<TEXTAREA%20NAME=userSeq%20ROWS=14%2 0COLS=80></TEXTAREA>%0A%0A</TD>%0A</TR>%0A<TR>%0A<TD%20COLSPAN=5%20ALIGN=CENTER> %0A<INPUT%20TYPE=SUBMIT%20NAME=Submit%20VALUE=submit>%0A<INPUT%20TYPE=SUBMIT%20N AME=Lucky%20VALUE="I'm%20feeling%20lucky">%0A<INPUT%20TYPE=RESET%20NAME=Reset%20 VALUE=clear>%0A</TD>%0A</TR>%0A<TR>%0A<TD%20COLSPAN=5%20WIDTH="100%">%0APaste%20 in%20a%20query%20sequence%20to%20find%20its%20location%20in%20the%0Athe%20genome .%20Multiple%20sequences%20may%20be%20searched%20%0Aif%20separated%20by%20lines% 20starting%20with%20'>'%20followed%20by%20the%20sequence%20name.%0A</TD>%0A</TR> %0A%0A<TR><TD%20COLSPAN=5%20WIDTH="100%">%0A%0A
File%20Upload:%20%0ARather%20than%20pasting%20a%20sequence,%20you%20can%20choose%20to%20upload %20a%20text%20file%20containing%20the%20sequence.
%0AUpload%20sequence:%20<INPUT%20TYPE=FILE%20NAME="seqFile">%0A%20<INPUT%20TYPE= SUBMIT%20Name=Submit%20VALUE="submit%20file">

%0A%0A

Only%20DNA%20sequences%20of%2025,000%20or%20fewer%20bases%20and%20protein%20or%2 0translated%20%0Asequence%20of%2010000%20or%20fewer%20letters%20will%20be%20proc essed.%20%20Up%20to%2025%20sequences%0Acan%20be%20submitted%20at%20the%20same%20 time.%20The%20total%20limit%20for%20multiple%20sequence%0Asubmissions%20is%2050, 000%20bases%20or%2025,000%20letters.%0A

For%20locating%20PCR%20primers,%20use%20<A%20HREF="../cgi-bin/hgPcr?db=hg18">In- Silico%20PCR%20for%20best%20results%20instead%20of%20BLAT.

</TD></TR></TABLE>%0A%0A</FORM>%0A<FORM%20ACTION="../cgi-bin/hgBlat"%20METHOD="P OST"%20NAME="orgForm"><input%20type="hidden"%20name="db"%20value="">%0A<input%20 type="hidden"%20name="org"%20value="">%0A<input%20type="hidden"%20name="userSeq" %20value="">%0A<input%20type="hidden"%20name="showPage"%20value="true">%0A<input %20type="hidden"%20name="seqFile"%20value="">%0A<INPUT%20TYPE=HIDDEN%20NAME="hgs id"%20VALUE="74615948"></FORM>%0A%0A%09</TD><TD%20WIDTH=15></TD></TR></TABLE>%0A %09
</TD></TR></TABLE>%0A%09</TD></TR></TABLE>%0A%09%0A%0A
%0A%0A%20%20%09%0A%20%20%09<T ABLE%20WIDTH="100%"%20BGCOLOR="#888888"%20BORDER="0"%20CELLSPACING="0"%20CELLPAD DING="1"><TR><TD>%09%0A%20%20%20%20<TABLE%20BGCOLOR="#FFFEE8"%20WIDTH="100%"%20% 20BORDER="0"%20CELLSPACING="0"%20CELLPADDING="0"><TR><TD>%09%0A%09<TABLE%20BGCOL OR="#D9E4F8"%20BACKGROUND="/images/hr.gif"%20WIDTH="100%"><TR><TD>%0A%09%09<FONT %20SIZE="4"> %20%0AAbout%20BLAT%09</TD></TR></TABLE>%0A%09<TABLE%20BGCOLOR="#FFFEE8"%20WIDTH="100%"%20CELLPADDING= 0><TR><TH%20HEIGHT=10></TH></TR>%0A%09<TR><TD%20WIDTH=10> </TD><TD>%0A%0A%0A

BLAT%20on%20DNA%20is%20designed%20to%0Aquickly%20find%20sequences%20of%2095%%20a nd%20greater%20similarity%20of%20length%2040%20bases%20or%0Amore.%20%20It%20may% 20miss%20more%20divergent%20or%20shorter%20sequence%20alignments.%20%20It%20will %20find%0Aperfect%20sequence%20matches%20of%2033%20bases,%20and%20sometimes%20fi nd%20them%20down%20to%2020%20bases.%0ABLAT%20on%20proteins%20finds%20sequences%2 0of%2080%%20and%20greater%20similarity%20of%20length%2020%20amino%0Aacids%20or%2 0more.%20%20In%20practice%20DNA%20BLAT%20works%20well%20on%20primates,%20and%20p rotein%0Ablat%20on%20land%20vertebrates.%0A

BLAT%20is%20not%20BLAST.%20%20DNA%20BLAT%20works%20by%20keeping%20an%20index%20o f%20the%20entire%20genome%0Ain%20memory.%20%20The%20index%20consists%20of%20all% 20non-overlapping%2011-mers%20except%20for%0Athose%20heavily%20involved%20in%20r epeats.%20%20The%20index%20takes%20up%20a%20bit%20less%20than%0Aa%20gigabyte%20o f%20RAM.%20%20The%20genome%20itself%20is%20not%20kept%20in%20memory,%20allowing% 0ABLAT%20to%20deliver%20high%20performance%20on%20a%20reasonably%20priced%20Linu x%20box.%0AThe%20index%20is%20used%20to%20find%20areas%20of%20probable%20homolog y,%20which%20are%20then%0Aloaded%20into%20memory%20for%20a%20detailed%20alignmen t.%20Protein%20BLAT%20works%20in%20a%20similar%0Amanner,%20except%20with%204-mer s%20rather%20than%2011-mers.%20%20The%20protein%20index%20takes%20a%20little%0Am ore%20than%202%20gigabytes

%0A

BLAT%20was%20written%20by%20<A%20HREF="mailto:kent@soe.ucsc.edu">Jim%20Kent.%0ALike%20most%20of%20Jim's%20software,%20interactive%20use%20on%20 this%20web%20server%20is%20free%20to%20all.%0ASources%20and%20executables%20to%2 0run%20batch%20jobs%20on%20your%20own%20server%20are%20available%20free%0Afor%20 academic,%20personal,%20and%20non-profit%20purposes.%20%20Non-exclusive%20commer cial%0Alicenses%20are%20also%20available.%20%20Contact%20Jim%20for%20details.

%0A%0A%0A%09</TD><TD%20WIDTH=15></TD></TR></TABLE>%0A%09
</TD></TR></TABLE>%0A%09</TD></TR></TABLE>%0A%09%0A</TD></TR></TABLE>%0A%0A</BOD Y></HTML>” because it can’t find the server “genome.ucsc.edu"-”.

Reply

Answer 12

Cyclosaurus

Level 6

12,915 points

Jul 6, 2006 10:22 AM in response to Baka

You need to extract links command from Automator action, instruction is in my second post, fifth paragraph down.

Reply

Answer 13

Baka Author

Level 1

0 points

Jul 6, 2006 1:50 PM in response to Cyclosaurus

Excellent! It's working like a charm, Cyclosaurus, thank you!

I completely forgot about the links command. And, I remember reading the
instructions when you first gave it to me, too. Oh well.

Now, the thing I need to do is be able to automate the input and output. In
other words, imagine if I have a folder with 100 text files, each with a single
DNA sequence. Can I automate the following: (1) open the first file in the
older, (2) copy the text, (3) put it into the BLAT code you gave me above, (4)
catch the text of the resulting page, and (5) paste this text into a new text
file.

I would want to then run some Perl scripts on this new file that I've written ...

So, this is possible, yes?

I appreciate how much time you've put into this so far very much. You're the
best!

Dual 2.7 GHz PowerPC G5 w/ 4 GB DDR SDRAM Mac OS X (10.4.6)

Dual 2.7 GHz PowerPC G5 w/ 4 GB DDR SDRAM Mac OS X (10.4.6)

Reply

Answer 14

Cyclosaurus

Level 6

12,915 points

Jul 6, 2006 5:47 PM in response to Baka

I mod the script so it'll do DNA lookup in a repeat loop. What it does:

1) Prompt for location of folder that holds DNA files. --> DNA
2) Prompt for Genome.
3) Create a new folder for results. --> DNA_results
4) For each DNA file, create sub folder in DNA_results folder --> DNA results/DNAfile
5) Output top rank sequences to DNA_file folder --> DNA results/DNA_file/1_DNAfile, 2 DNAfile, etc...

Here it is, give it a try:
-- begin
on run
-- locate DNA files folder
set DNA_folder to choose folder with prompt "Please locate DNA files folder"
tell application "Finder"
-- get files in DNA files folder
set DNA_list to items of DNA_folder
-- make results folder base on DNA files folder
set folder_name to name of DNA_folder
set folder_container to container of DNA_folder
if not (exists item ((folder_container as string) & folder_name & "_results")) then
set result_folder to make new folder at folder_container with properties {name:(folder_name & "_results")}
else
set result_folder to item ((folder_container as string) & folder_name & "_results")
end if
end tell
with timeout of (8 * hours) seconds -- change this if need more time
return my DNA query(DNAlist, result_folder)
end timeout
end run

on DNA query(DNAlist, result_folder)
set AppleScript's text item delimiters to {""}
set base_URL to "http://genome.ucsc.edu"

-- this is Genome pick list, you can add more into this list
set genome_list to {"Human", "Chimp", "Rhesus"}
set _genome to (choose from list genome_list without multiple selections allowed) as string

repeat with DNA_file in DNA_list
-- make sub folder in results folder for result files
tell application "Finder"
-- get filename for subfolder and result files, base on this file name
set file_name to name of DNA_file
if not (exists (item file_name) of result_folder) then
set sub_folder to make new folder at result_folder with properties {name:file_name}
else
set sub_folder to item file_name of result_folder
end if
end tell

set DNA_text to read file (DNA_file as string)
set DNA_seq to ""
-- get rid of return
repeat with _paragraph in paragraphs of DNA_text
set DNA_seq to (DNA_seq & _paragraph) as string
end repeat
-- return {_genome, DNA_seq}

-- download the result links, any of these parameters can be change manually
do shell script "/usr/bin/curl 'http://genome.ucsc.edu/cgi-bin/hgBlat' -d hgsid=74425287 -d sort='query,score' -d output=hyperlink -d type=\"BLAT's guess\" -d hg18=\"Mar. 2006\" -d org=" & _genome & " -d userSeq=" & quoted form of DNA_seq & " -o /tmp/results_source.txt"

-- get result links in the result page
set _links to do shell script "/usr/local/bin/links file:///tmp/results_source.txt"
set result_links to {}
set result_count to 0
-- stop parsing links when hit the limit, this determin how results you want to get
set result_limit to 5 -- change if you want more/less results
-- start parsing for hgc? links
repeat with _paragraph in paragraphs of _links
if _paragraph contains "hgc?" then
set result_count to result_count + 1
set AppleScript's text item delimiters to {"file:///"}
copy {(base_URL & last text item of _paragraph) as string} to end of result_links
set AppleScript's text item delimiters to {""}
if result_count is equal to result_limit then exit repeat
end if
end repeat

set AppleScript's text item delimiters to {""}

-- this last part open the result pages in Safari
-- but you can also use curl to direct save the files
set i to 1
tell application "Safari"
repeat with result_link in result_links
make new document
set (URL of document 1) to (result_link as string)
set wait_flag to ""
repeat until wait_flag is "complete"
delay 1
set wait_flag to do JavaScript "document.readyState" in document 1
end repeat
set text_result to text of document 1
delay 2
-- write to result file
do shell script "/bin/echo " & quoted form of text_result & " > " & quoted form of (POSIX path of ((sub_folder as string) & i & "_" & file_name))
set i to i + 1
close window 1
end repeat
end tell
end repeat
end DNA_query
--end

Reply

Answer 15

Baka Author

Level 1

0 points

Aug 25, 2006 2:23 PM in response to Baka

I just wanted to hop back on and let you, Cyclosaurus and dev_sleidy, that I really appreciate your help. I have been very busy and haven't quite gotten all the bugs worked out, but your suggestions have been a source of great assistance. Thanks!

Reply