gefaria

Q: Check for duplicate files

I´ve seen "duplicate" scenarios before, even a bunch of mac apps that helps to find and/or get rid of duplicate files in the hard disk in order to gain some space. That is not my case this time though, and by the way, this is NOT a weird-o case. I´m sure many of us found our selfs in a similar situation every once in a while.

So please, keep on reading.

I have a (1) small folder full of (hundreds) single files with no sub-folder system in it, and another (2) big folder full of (thousands) of files, folders, subfolders and more files....arranged in complex directory scheme.

 

Most of the files in folder (1) have a copy in folder (2), and that is OK. In fact what I need is to make sure that ALL files in folder (1) have an exact copy in folder 2 regardless of it´s heritage.

I can go thru the process of checking each file, one by one, and once finding its copy in folder (2), delete it in folder (1). That way I will end up with a tiny little folder (1) with only the few files which´s copy couldn't be found in folder (2).

Is there any way to automate the process?, using automator perhaps?, do you know an app that can help me achieve that?

Thanks a lot! 

Posted on Jun 21, 2016 11:53 AM

Close

Q: Check for duplicate files

  • All replies
  • Helpful answers

  • by Camelot,Helpful

    Camelot Camelot Jun 21, 2016 11:04 PM in response to gefaria
    Level 8 (47,233 points)
    Mac OS X
    Jun 21, 2016 11:04 PM in response to gefaria

    Conceptually this is pretty straightforward - a little AppleScript, or maybe shell script and you're done.

     

    As always with these kinds of questions, though, the devil is in the details. In this case, what - specifically - constitutes a match? are files with the same name considered a match? what if one is newer/older/larger/smaller than the other? are they still considered a match? which one should you keep? the newest? the biggest?

     

    Once you clarify that, the rest should be easy.

  • by gefaria,

    gefaria gefaria Jun 21, 2016 11:12 PM in response to Camelot
    Level 1 (4 points)
    Mac OS X
    Jun 21, 2016 11:12 PM in response to Camelot

    Thanks for your answer!

    The only action should be to delete or not the file in folder (1), not to choose which one comparing them. And I believe the match should be considered when two things match: the name (including the extension) and the size. Can you write the script?, I just don't know how to code 

  • by Camelot,Solvedanswer

    Camelot Camelot Jun 23, 2016 1:00 AM in response to gefaria
    Level 8 (47,233 points)
    Mac OS X
    Jun 23, 2016 1:00 AM in response to gefaria

    The following script (minimally tested!) should do what you want. Copy the script into a new Script Editor document and run it (the delete command is actually a misnomer... it only moves the files to the trash so there's still a chance of recovery.

     

    set folder1 to (choose folder with prompt "Please select the folder to be cleaned")

    set folder2 to (choose folder with prompt "Please select the folder to compare")

     

    -- fast way to get a list files in a directory

    set fileList to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -exec basename {} \\;"

    set fileList to paragraphs of fileList

     

    tell application "Finder"

      set folder1Files to every file of folder1

      repeat with eachFile in folder1Files

      set fName to name of eachFile

      if fName is in fileList then

      -- we have a filename match

      set f1Size to size of eachFile as integer

      set matchingf2File to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -name " & quoted form of fName & " -size " & f1Size & "c"

      if matchingf2File is not "" then

      -- we have a duplicate, so:

      delete eachFile

      end if

      end if

      end repeat

    end tell


     

    It probably needs some explanation...

     

    It starts off by prompting for two folders - the first should be the one that contains the flat directory that you want to clean up (folder1). The second should be the one with the hierarchal directories you want to search in (folder2).

     

    set folder1 to (choose folder with prompt "Please select the folder to be cleaned")

    set folder2 to (choose folder with prompt "Please select the folder to compare")

     

    Then it uses a shell command find to get a listing of all the files in folder2. I do this because the Finder is notoriously slow in traversing large directory trees, so even though using the Finder would be simpler, it would be much slower.

     

    set fileList to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -exec basename {} \\;"

    set fileList to paragraphs of fileList

     

    Now I iterate through the files in folder1, checking the name of the file against the cached list of files in folder2.

     

      set folder1Files to every file of folder1

      repeat with eachFile in folder1Files

      set fName to name of eachFile

      if fName is in fileList then


    If there are no matches I move on to the next file, otherwise we at least have a file that has the same name, so we need to check its size.

     

      set f1Size to size of eachFile as integer

      set matchingf2File to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -name " & quoted form of fName & " -size " & f1Size & "c"

     

    Here I use another shell trick. I first get the size of the current file. I then perform another find to find a file that has the same size as the current find. If I get back an empty list I know the file sizes are different, so I leave the file alone, but if the file sizes match I know it's safe to delete the file.

     

    I know this may sound convoluted, but for a large directory tree, with a large number of files in folder1, it would be cumbersome/unwieldy to perform a full depth traversal of folder2 for every file, so I first cache the list of file names and just do a secondary search for those that have matching filenames.

  • by gefaria,

    gefaria gefaria Jun 23, 2016 1:10 AM in response to Camelot
    Level 1 (4 points)
    Mac OS X
    Jun 23, 2016 1:10 AM in response to Camelot

    Camelot,

     

    Thank you very VERY much for your kindness, your time invested in such an altruistic help. I hope this little script helps not only me, but other people as well. I´m running it as I write, according to my calculations it will take about a week to finish (since a hear the little trash sound every time it deletes a file  ), but it is obviously much better that doing it manually.

     

    Please contact me through my web site if there is anything you need, I owe you one, a BIG one,

     

    GF

    manoDerecha

    www.manoderecha.es

  • by gefaria,

    gefaria gefaria Jun 24, 2016 7:21 AM in response to Camelot
    Level 1 (4 points)
    Mac OS X
    Jun 24, 2016 7:21 AM in response to Camelot

    Camelot,

     

    I´m sad to tell you that the script is not working any more(at least on my mac, with the folders I need it to work).

     

    It ran fine (very slowly) the first time, it crashed in about an hour or two, and I re-started it a couple of times after that. Last night I left it running and encountered it this morning with a "time out" error. Maybe it is due to the fact that both folders are enormous (folder1 being almost 3 GB with near 30.000 files, folder 2 being more that ten times that), or some issue like a memory leak, I don´t know.

     

    The only solution I´m able to think of is to re-run it after restarting my mac, and that isn´t helping. It works great with smaller folders though, I test it a little bit.

     

    Any suggestion?

  • by VikingOSX,

    VikingOSX VikingOSX Jun 24, 2016 8:50 AM in response to gefaria
    Level 7 (20,591 points)
    Mac OS X
    Jun 24, 2016 8:50 AM in response to gefaria

    Wrap your Finder tell block with:

     

    with timeout of 43200 seconds  -- 12 hours

     

    tell application "Finder"

    end tell

     

    end timeout

  • by Camelot,

    Camelot Camelot Jun 24, 2016 10:58 AM in response to gefaria
    Level 8 (47,233 points)
    Mac OS X
    Jun 24, 2016 10:58 AM in response to gefaria

    As I said before... the devil is in the details

     

    30,000 is a lot of files to have in one directory. From what you said I expected it to be large, which is why I opted for a find in the shell rather than the Finder, but even then there are limitations.

     

    That said, since you're still getting errors, the simplest solution would be to break down the script into a loop that processes a subset of the files each time, something replacing the line:

     

      set folder1Files to every file of folder1

     

    with:

     

      repeat with firstChar in {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"}

      set folder1Files to (every file of folder1 whose name begins with firstChar)

       ...

     

    (with a corresponding 'end repeat' at the end). This will break the list of 30,000 files into (hopefully) more manageable chunks based on the first character - you can change that to break on any parameters you like if you need to