Apple Event: May 7th at 7 am PT

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Can applescript search one file for non-matches in another file?

I'd like to know if it's possible for Applescript to do the following:


Take a text file with 600+ URLs and compare it to another text file with 100,000+ URLs and set out as a result any URL from the first small list that is NOT included in the second larger list?


If it is possible to do this, is there anyone out there willing to create such a script for a modest fee?

Posted on Oct 6, 2013 2:18 PM

Reply
46 replies

Oct 6, 2013 9:05 PM in response to Frank Caggiano

Frank and Tony,


Thanks very much for jumping in and for all the help. You guys are da bomb!


I decided to take the easier route (for me) of just changing the File1 file name so it didn't have any spaces. Then I truncated the first file just so that the script would run faster, leaving only a dozen or URLs in File1 to be checked against the large File2. I then changed one of those URLs in File1 purposely in a manner that would ensure that particular URL wouldn't appear in File2. The results in the missingUrlFile file should have been a single line with that errant URL in it.


But, this time the script ran for several minutes after asking me for the two file names. In the Applescript window itself, the results window showed just double quotes: "" . I found the file named missingUrlFile, but instead of it being just a couple of K in size (since it should have contained only a single URL as a result), it was 686 MB! It seemed to contain most of the contents of File2.


Any suggestions?

Oct 6, 2013 9:16 PM in response to cdworin

Correction: the result I describe above was from "Find" in the Finder. When I opened up my username folders and files, missingUrlFile was there, at Zero bytes. A new file called patternFile had been added with 668.5MB. My original File1 was 4KB in size, and the original File2 was 781MB. Still no results file that I could see with that single URL as the result

Oct 6, 2013 9:18 PM in response to cdworin

I tried it out with two files I created with some urls and surrounding text and it worked.


The best thing now would be for you to make available two of your actual files assuming they are not confidential. If you have a Dropbox account that would be easiest. If not we can make other arangements.


If you can make the files available i'll look at it in the am unless it's solved before then.


Regards

Oct 7, 2013 12:34 AM in response to cdworin

in vanilla applescript you would do something like this:


-- specify filepaths

set fileOne to "/path/to/list_file"

set fileTwo to "/path/to/compare_file"

set outputFile to "/path/to/output file.txt"


-- read the list file and the main file

set fileOneText to readfileOne

set fileTwoText to readfileTwo


-- open/create the output file for writing

set fp to open for accessoutputFile with write permission

repeat with thisurl in (paragraphs of fileOneText)


-- each paragraph in the list file is a single url, so check to see if it exists in the main file

if fileTwoText contains thisurl then


-- url exists, do what you like, if anything

else


-- url does not exist, write to output file on separate line


writethisurl & returntofp

end if

end repeat

close accessfp

Oct 7, 2013 12:36 AM in response to Frank Caggiano

Frank,


You emailed me and said:


"

I tried it out with two files I created with some urls and surrounding text and it worked.


The best thing now would be for you to make available two of your actual files assuming they are not confidential. If you have a Dropbox account that would be easiest. If not we can make other arangements.


If you can make the files available i'll look at it in the am unless it's solved before then.


Regards"


For some reason that message didn't appear in the discussion thread. I'm probably just looking in the wrong place.


In any event, unfortunately, the files are confidential. But, I've created amended, truncated versions of them that should serve the same purpose. I've put four files in a drop box folder:


A) File1.txt (the small file),

B) File2.txt (the large file against which File1 is compared)

C) TheResultsAsTheyshouldBe.txt (which is the one URL from File1 that is not in File2--i.e., this is what the results should look like), and

D) missingUrlFile.txt (what the output results actually look like).


Here's the Dropbox link: https://www.dropbox.com/sh/90b8vzyuoaok3lp/zbNVQY9pmv


Thanks for all this!


Chris

Oct 7, 2013 5:30 AM in response to cdworin

For some reason that message didn't appear in the discussion thread. I'm probably just looking in the wrong place.


Yep the software running this site is having massive problems. A lot of users (myself included) are experiencing the same thing. You can force 'hidden' posts to be visible by adding a post to the thread.


Anyway looked at the the files you put into Dropbox and see the problem. I was not expecting multiple URLs on a single line. The example you posted looked like one URL per line but looking back at it I see that was how the line got wrapped when you pasted it in.


Will need to play with this a bit. The simple solution I showed using grep to get the URLs out of the file might not work.


One final question, the file1 you placed in Dropbox has only a list of URLs one per line with no other text in the file. But in your example you showed file1 as lokking like



The Langham, Bostonhttp://images.travelnow.com/hotels/1000000/10000/2600/2558/2558_66_b.jpg


with a text title above the URL. Which way will it be?

Oct 7, 2013 7:18 AM in response to cdworin

Try this (I modified slightly Frank's excellent grep solution):


Note: this assumes all url's end with ".jpg" If this is not the case, then grep only needs to be modified. Let us know if your URL's end in anything other than .jpg (even if you're only interested in jpg's)




setf1toPOSIX pathof (choose filewith prompt "Select smaller file to compare:" default locationalias (thepath todesktop folderastext))

set f2 to POSIX path of (choose file with prompt "Select larger file to compare:" default location alias (the path to desktop folder as text))

set u1 to (POSIX path of (path to "temp" from user domain)) & "URL1.txt"

set u2 to (POSIX path of (path to "temp" from user domain)) & "URL2.txt"

set t1 to (POSIX path of (path to "temp" from user domain)) & "sorted1.txt"

set t2 to (POSIX path of (path to "temp" from user domain)) & "sorted2.txt"

do shell script "grep -o -E '(https?|ftp|file)://.+?\\.jpg' " & f1 & " > " & u1

do shell script "grep -o -E '(https?|ftp|file)://.+?\\.jpg' " & f2 & " > " & u2

do shell script "sort" & space & u1 & space & ">" & space & t1

do shell script "sort" & space & u2 & space & ">" & space & t2

do shell script "comm -2 -3" & space & t1 & space & t2 & space & "> ~/Desktop/diff.txt"

do shell script "rm" & space & t1 & space & t2 & space & u1 & space & u2

Oct 7, 2013 9:02 AM in response to Tony T1

Opps! Forgot "quoted formof" 😊


set f1 to quoted form of (POSIX path of (choose file with prompt "Select smaller file to compare:" default location alias (the path to desktop folder as text)))

set f2 to quoted form of (POSIX path of (choose file with prompt "Select larger file to compare:" default location alias (the path to desktop folder as text)))

set u1 to quoted form of ((POSIX path of (path to "temp" from user domain)) & "URL1.txt")

set u2 to quoted form of ((POSIX path of (path to "temp" from user domain)) & "URL2.txt")

set t1 to quoted form of ((POSIX path of (path to "temp" from user domain)) & "sorted1.txt")

set t2 to quoted form of ((POSIX path of (path to "temp" from user domain)) & "sorted2.txt")

do shell script "grep -o -E '(https?|ftp|file)://.+?\\.jpg' " & f1 & " > " & u1

do shell script "grep -o -E '(https?|ftp|file)://.+?\\.jpg' " & f2 & " > " & u2

do shell script "sort" & space & u1 & space & ">" & space & t1

do shell script "sort" & space & u2 & space & ">" & space & t2

do shell script "comm -2 -3" & space & t1 & space & t2 & space & "> ~/Desktop/diff.txt"

do shell script "rm" & space & t1 & space & t2 & space & u1 & space & u2



Note: this assumes all url's end with ".jpg" If this is not the case, then grep only needs to be modified. Let us know if your URL's end in anything other than .jpg (even if you're only interested in jpg's)

Oct 7, 2013 9:28 AM in response to cdworin

rewritten with choose file commands rather than path specifications. All I really meant for you to do was to edit in the the posix paths to the files you're working with, but this works just as well.


-- choose the file with the list of URLs to check

set fileOne to choose file with prompt "Choose URL list file"

-- choose the file with the HTML to search

set fileTwo to choose file with prompt "Choose file to be searched for URLs"


-- unused URLs will be saved in this file on the desktop

set outputFile to (POSIX path of (path to desktop from user domain)) & "missing urls.txt"


-- read the list file and the main file

set fileOneText to readfileOne

set fileTwoText to readfileTwo


-- open/create the output file for writing

set fp to open for accessoutputFile with write permission

repeat with thisurl in (paragraphs of fileOneText)


-- each paragraph in the list file is a single url, so check to see if it exists in the main file

if fileTwoText contains thisurl then


-- url exists, do what you like, if anything

else


-- url does not exist, write to output file on separate line

writethisurl & returntofp

end if

end repeat

close accessfp

Oct 7, 2013 10:49 AM in response to cdworin

Hello


You may also try the following AppleScript script, which is a simple wrapper of a Perl script. It will ask to choose master file and url file(s) and create output file named "missing_urls.txt" on Desktop. Script will ignore thumbnail urls in master file. Output file is in UTF-8 and line is terminated by LF.


It will take a while to process a huge master file as large as 800 MB.


Good luck,

H


--applescript
set f0 to choose file with prompt "Choose master file"
set ff to choose file with prompt "Choose url file(s) to be tested" with multiple selections allowed
set args to ""
repeat with f in {f0} & ff
    set args to args & f's POSIX path's quoted form & space
end repeat

do shell script "/usr/bin/perl -CSDA -w <<'EOF' - " & args & " > ~/Desktop/missing_urls.txt
use strict;
use open IN => ':crlf';

die qq(Usage: $0 <corpus file> <file> [<file> ...]) if @ARGV - 2 < 0;
my %hash = ();
open(CORPUS, '<', $ARGV[0]) or die qq($!);
while (<CORPUS>) {
    $hash{$1} = 1 if m%(https?://[^|]+?)(\\||$)%o;
}
close CORPUS;
shift @ARGV;
local $\\=qq(\\n);
while (<>) {
    chomp;
    print unless length == 0 || defined($hash{$_});
}
EOF"
--end of applescript

Can applescript search one file for non-matches in another file?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.