Previous 1 2 3 4 Next 46 Replies Latest reply: Oct 7, 2013 12:13 PM by Mark Jalbert
cdworin Level 1 Level 1 (0 points)

I'd like to know if it's possible for Applescript to do the following:

 

Take a text file with 600+ URLs and compare it to another text file with 100,000+ URLs and set out as a result any URL from the first small list that is NOT included in the second larger list?

 

If it is possible to do this, is there anyone out there willing to create such a script for a modest fee?

  • Tony T1 Level 6 Level 6 (8,685 points)

    set f1 to POSIX path of (choose file with prompt "Select smaller file to compare:" default location alias (the path to desktop folder as text))

    set f2 to POSIX path of (choose file with prompt "Select larger file to compare:" default location alias (the path to desktop folder as text))

    do shell script "comm -2 -3" & space & f1 & space & f2 & space & "> ~/Desktop/diff.txt"

     

     

    No charge

  • Tony T1 Level 6 Level 6 (8,685 points)

    Just remembered that the above will only work if both files are sorted.

    This will sort the flles into temporary files "sorted1.txt" and "sorted1.txt" in  ~/Library/Caches/TemporaryItems/ and then delete the temp files after comparison.

    The result will be output to your Desktop as "diff.txt"

     

     

    set f1 to POSIX path of (choose file with prompt "Select smaller file to compare:" default location alias (the path to desktop folder as text))

    set f2 to POSIX path of (choose file with prompt "Select larger file to compare:" default location alias (the path to desktop folder as text))

    set t1 to (POSIX path of (path to "temp" from user domain)) & "sorted1.txt"

    set t2 to (POSIX path of (path to "temp" from user domain)) & "sorted2.txt"

    do shell script "sort" & space & f1 & space & ">" & space & t1

    do shell script "sort" & space & f2 & space & ">" & space & t2

    do shell script "comm -2 -3" & space & t1 & space & t2 & space & "> ~/Desktop/diff.txt"

    do shell script "rm" & space & t1 & space & t2

  • Frank Caggiano Level 7 Level 7 (25,695 points)

    What's in the text files besides the URLs? What is the format of the text files.  Does every line in each file have a URL on it or are there lines that are just text?

     

    Include a few line sample from each file.

  • cdworin Level 1 Level 1 (0 points)

    Tony,

     

    Thanks very much for this. I appreciate it. Unfortunately, I don't think your solution will work for me, as the URLs are embedded in a string of other characters, and sorting will not put the URLs in any kind of useful sort order.

     

    To be more specific, my smaller list has hundreds of URLs in a list like this:

     

    The Langham, Boston
    http://images.travelnow.com/hotels/1000000/10000/2600/2558/2558_66_b.jpg

    And the much longer list has 100,000 lines something like this:

     

    4110|Resorts Casino Hotel Atlantic City||http://images.travelnow.com/hotels/1000000/50000/40200/40186/40186_12_b.jpg|Expedia Hotels|350|350||http://images.travelnow.com/hotels/1000000/50000/40200/40186/40186_12_t.jpg|False

     

    I'm trying to determine if the image URL for The Langham, Boston (and the other 600 images for various), no longer appear embedded in the strings in the longer list.

  • cdworin Level 1 Level 1 (0 points)

    Frank,

     

    Thanks for offering to help. See the answer I just posted a few minutes ago to Tony's email for examples. I can get both files into plain text format. Not every line will have an URL on it.

     

    Chris

  • cdworin Level 1 Level 1 (0 points)

    Perhaps this will clarify what I want, at least conceptually:

     

    Look for first URL in File1

    Search File2 for that URL

    If URL is found in File2, go to next step

    If URL is not found in File2, output that URL to a list and go to next step

     

    Find second URL in File1

    repeat

    repeat

    repeat

    ..... until all URLs in File1 have been searched for in File2

  • Frank Caggiano Level 7 Level 7 (25,695 points)

    So you are just interested in the URL's, not in any of the other text. And the output is just the URLs' not the surrounding text or anything else?

     

    What I seeas one possible solution is to first go through each file and strip out the URL's into two separate files  and then use those files to generate the list but this will only work if you are not interested in the context the URL's are in.

  • cdworin Level 1 Level 1 (0 points)

    Frank,

     

    You're correct that all I need as the output is the bare URL of any URL that appears in File1 but not in File2. Each week I don't expect that the script would generate more than a handful of results. So, if the output of the script was, for example, five URLs from File1 that no longer appear in File2 I could do a subsequent manual search of File2 for those five URLs and find the specific context.

     

    But, if the pre-requisite to your possible solution is to strip out all the URLs from File2 it seems to me that the scripting required to accomplish that would be as complex as the script I'm requesting that simply searched File2 for each URL in File1, consecutively, and flagged in some way any URL search that results in a null set. But then I'm not a programmer...

     

    Thanks,

     

    Chris

  • Frank Caggiano Level 7 Level 7 (25,695 points)

    Are you familiar with Terminal and the command line? If so try runing this command on one  of the files:

     

    # grep -o -E '(https?|ftp|file)://.+' file

     

    It should print just the URLs in file.

     

    If you're not use to the terminal I can package that in a script but I just wanted to get confirmation that it will wotrk on your files before going forward.

  • Frank Caggiano Level 7 Level 7 (25,695 points)

    OK really quick and dirty and will need work  to make it a final script but this

    (*

     

              The patteren file is the smaller file with the URLs that we will look for in the URL file

     

    *)

     

    set file1 to POSIX path of (choose file with prompt "Select patteren file:" default location alias (the path to desktop folder as text))

    set file2 to POSIX path of (choose file with prompt "Select URL file:" default location alias (the path to desktop folder as text))

     

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & file1 & " > ~/patternFile"

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & file2 & " > ~/urlFile"

     

    do shell script "grep -v -f ~/patternFile ~/urlFile > ~/missingUrlFile"

    I do believe will do what you want.

     

    The first prompt will ask for the patteren file, that is the file that has the URLs that should be in the second file.

    The second prompt will ask for the URL file, that is the file that has all the URLs

     

    The output will be a file in your home directory call missingUrlFile and should have the URLs that are in the pattern file but not in the URL file.

     

    As I said needs work, no error checking and the tempoary files are left behind so that if it doesn't work I can see what was going on.

     

    Give it a shot and see what happens.

     

    regards

  • cdworin Level 1 Level 1 (0 points)

    Thanks so much, Frank.  I tried putting the entire code you sent to me into the Applescript editor and ran it. It did appear to run and it successively asked me to select the two files, which I did. But it then found errors. Those files are on my desktop, so I'm not sure why the "no such files or directory" error message is coming up.

     

    Applescript error.png

     

    I also tried it by deleting the first lines, so that the script started with "Set file1...." Same result.

     

    Thanks,

     

    Chris

  • Frank Caggiano Level 7 Level 7 (25,695 points)

    Hard to read the screenshot. next time cut and paste the test into the reply. Also select Replies as the display for the window in Applescript.

     

    The error you get, the no such file or directory are you sure that the file exists? Looks like the file is called List and it is in your desktop?

     

    Just tried it here and it is working OK

     

    Here is another copy just in case something got messed up in the first one:

     

    (*

     

              The patteren file is the smaller file with the URLs that we will lok for in the URL file

     

    *)

     

    set file1 to POSIX path of (choose file with prompt "Select patteren file:" default location alias (the path to desktop folder as text))

    set file2 to POSIX path of (choose file with prompt "Select URL file:" default location alias (the path to desktop folder as text))

     

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & file1 & " > ~/patternFile"

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & file2 & " > ~/urlFile"

     

    do shell script "grep -v -f ~/patternFile ~/urlFile > ~/missingUrlFile"

     

     

    --- stop copying above this line

    Just so you can see what the output in the Replies window of Applescript will look like:

    tell current application

      path to desktop as text

      --> "Mac OS Lion:Users:frank:Desktop:"

    end tell

    tell application "AppleScript Editor"

      choose file with prompt "Select patteren file:" default location alias "Mac OS Lion:Users:frank:Desktop:"

      --> alias "Mac OS Lion:Users:frank:Desktop:f1"

    end tell

    tell current application

      path to desktop as text

      --> "Mac OS Lion:Users:frank:Desktop:"

    end tell

    tell application "AppleScript Editor"

      choose file with prompt "Select URL file:" default location alias "Mac OS Lion:Users:frank:Desktop:"

      --> alias "Mac OS Lion:Users:frank:Desktop:f2"

    end tell

    tell current application

      do shell script "grep -o -E '(https?|ftp|file)://.+' /Users/frank/Desktop/f1 > ~/patternFile"

      --> ""

      do shell script "grep -o -E '(https?|ftp|file)://.+' /Users/frank/Desktop/f2 > ~/urlFile"

      --> ""

      do shell script "grep -v -f ~/patternFile ~/urlFile > ~/missingUrlFile"

      --> ""

    end tell

    Result:

    ""

     

    Message was edited by: Frank Caggiano - See Tony's post below

  • Tony T1 Level 6 Level 6 (8,685 points)

    The error is due to spaces in the filename

     

    Just add quoted form of

     

      do shell script "grep -o -E '(https?|ftp|file)://.+' " & quoted form of file1 & " > ~/patternFile"

     

      do shell script "grep -o -E '(https?|ftp|file)://.+' " & quoted form of file2 & " > ~/urlFile"


  • Frank Caggiano Level 7 Level 7 (25,695 points)

    That's probably it but do you see spaces in the filename he input?

     

    I took the liberty of adding your changes into the script to make it easier for the OP, thanks.

     

    Revised script with quoted form of added in

     

     

    (*

     

              The patteren file is the smaller file with the URLs that we will lok for in the URL file

     

    *)

     

    set file1 to POSIX path of (choose file with prompt "Select patteren file:" default location alias (the path to desktop folder as text))

    set file2 to POSIX path of (choose file with prompt "Select URL file:" default location alias (the path to desktop folder as text))

     

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & quoted form of file1 & " > ~/patternFile"

     

    do shell script "grep -o -E '(https?|ftp|file)://.+' " & quoted form of file2 & " > ~/urlFile"

     

    do shell script "grep -v -f ~/patternFile ~/urlFile > ~/missingUrlFile"

     

    Message was edited by: Frank Caggiano

Previous 1 2 3 4 Next