A new revision which now handles -known- bundles/packages properly (it will for instance, deal with duplicate .Pages files and their duplicated names.) The list of what is a known bundle is within the script content (which can be viewed in TextEdit.) It will continue to ignore unknown bundle types.
http://www.mediafire.com/?zzriygz0oyq
The method I used to accommodate bundles was to handle them separately from normal files (which means additional time of course due to a secondary loop.) Bundles are found with 'find' and then the entire content of both forks of all the files within each bundle is piped to md5 for a total checksum of the bundle's content. The bundle's path (the path to the top folder of the bundle only) and the checksum are then added to the list as though it were a normal file. The bundle's checksum and name is then compared in the same manner as any other file and the bundle is renamed or trashed if needed with one difference in how trashing is handled. Because a bundle is a directory, the move command 'mv' won't overwrite an existing, non-empty directory in the trash so the script will move it into trash and if a name conflict occurs, it moves it to the trash with a random number as a name (only when the checksum is an identical match of course.)
The script could be streamlined so that the second loop for bundles is not necessary - files or bundles could be handled within the primary loop to find files (essentially by testing whether the file to checksum is a file or a folder - and if it's a folder then all it's files are checksummed together.)
The difference in trash behavior could also be overcome, perhaps by using an osascript call to have SystemEvents actually delete the file the way the Finder / GUI would (in which case the file/folder goes into the trash and the system handles any name conflict.)
The script could also be streamlined by eliminating the status output per file (but this takes us back to the point at which the user would not be able to tell what's happening during long delays.) If confidence in the script's selection of duplicates is high then files could be removed (as in 'rm',
poof gone) instead of moved to the trash.
There is a way to move files to the top-level folder... find /the/top/level/folder -mindepth 2 -type f -exec mv -n '{}' /the/top/level/folder \; comes to mind. That would be disastrous with bundles/packages though since all the files within the bundle would be moved into the top level folder and thus the bundle would be useless. In other words, don't do this if your data contains any bundles.
Another feature which might be nice would be to eliminate the hidden Directory Service files (.DS_Store) either prior to checksumming (because their content will affect the checksum of bundles - two identical bundles won't match if one is left in Icon view and the other switched to List view for instance) or after completion of all the other script steps (in preparation of removing empty folders.) Then after the .DS_Store files are gone, the command 'find -d the/top/level/folder -type d -empty -delete' would remove (as in
poof gone) all folders which are completely empty.
In regards to checksums, they are used to determine if two file's content is identical without actually having to check each and every byte of data in both files. So the question should really be "Is it necessary to confirm that two files have identical content before declaring one a duplicate and trashing it?" In my opinion, YES! The file's directory meta-data (date, size etc.) could be altered in many ways whether or not the file content is and vice versa.
Imagine a file named test.txt with one line of text in it:
Hello
Opened, edited and saved with the altered text:
Howdy
Both are 5 bytes yet not a duplicate. Normally the file system would show a different date/time for the second file but it is possible to use the command 'touch' to alter the date and time of either to match the other. It's also possible to experience directory corruption which could result in the dates being lost (and possibly the names as well.) Only a comparison of the actual content of the two files can be relied upon to indicate that they have the same content. The problem is that if the files are very large (a full-length movie for instance) then a byte-by-byte comparison would take a very long time. To speed-up the process, a checksum can be calculated for each file and then compared. There are different methods of generating checksums. MD5 is slower than CRC and faster than SHA1. In research experiments it has been shown that it is
possible for md5 to experience a collision (where two files with -different- content have the same checksum) however, I've never seen an md5 collision occur in the real world (and as far as I know, neither has anyone else.) So I used md5 checksums for the file-content comparison because it is more reliable than crc and faster than sha1 (but those and other options are certainly available.)
The time needed to generate the checksum is dependent on the size of the data being checksummed.
As Mike Bombich pointed out in the thread to which you linked, resource forks are another issue and since a resource fork can be modified without the file date being changed it becomes more important to check the content than the date -IF- you are concerned about the content of the resource forks (custom icons, attributes such as whether a folder is a package/bundle for instance, the retained 'Always open with...' program selection for the file, Finder comments etc. etc.)
There is another level of metadata - ownership and permissions and ACLs (Access Control Lists), which we really haven't bothered with at all (most likely because you are the sole user of your computer and thus all your documents are 'owned' by your user account.) Some other user might prefer that two otherwise identical documents, one with customized permissions and one without be handled differently.
It would also be possible to replace duplicate files with a Finder alias or symlink (similar to an alias) to the effect that all the files would appear to still be exactly where they had been in the archive, but identical duplicates would be removed and merely linked to one copy of the file and thus space is saved. This could be desirable or undesirable depending on one's point-of-view (editing the document from within two different folders is changing the same document, not two copies - which could be a problem for some users.)
No worries on the timing of your responses, I'm fairly patient and typically have a long attention span. 🙂
Message was edited by: JulieJulieJulie