Can I copy files – but with certain restrictions?

Question

Level 2

367 points

Can I copy files – but with certain restrictions?

I have just finished a large project, the archives of which involves about 5000 "base" files, stored on about 80 CDs and 50 DVDs, involving about 50,000 files in total. Each of the "base" files may have had up to 30 incremental versions. i.e. a certain text file may have undergone revision 23 times, and each revision was saved and archived to (probably) a different disk, with a different suffix – a, b, c and so on. But sometimes the suffix didn't change even though the file was edited. I might have done a bit more dust removal on an image and just overwrote the old file (already archived), and so the new one was archived on a different disk.

I now have 130 disks from which I would like to extract all the files and collapse them to one large archive that will probably span about 20 disks by the time I delete some files not needed. That way I can easily search for all versions of, say, GB097, by going to the particular DVD that has the "G" files on it. Up would come:

GB097
GB097a
GB097b
GB097b-1
GB097b-2
GB097c
... and so on.

This is what I would like to do:

1. Grab the first archive disk, open every folder, and copy all the files to the one folder on a hard drive.

2. Open the second disk and repeat step (1), but with these two provisos.

(a) If a file is identical to a previously copied file (maybe I archived it twice), the file isn't copied. However...

(b) If a file has the same name as a previously copied file, but the data within that file is different (i.e. I removed some dust from an image file, but left the name unchanged), I'd like that file to be copied with a numbered suffix, the same way that Trash treats identically named files.

Any suggestions how I could do this?

G5 iSight, Mac OS X (10.4.11)

Posted on May 18, 2010 3:27 AM

Reply

Answer 1

May 20, 2010 11:51 AM in response to JulieJulieJulie

Here is an updated version of the script which outputs an index of all files, their paths and their MD5 checksums.
http://www.mediafire.com/?cmk2zdmtmmf

Reply

Answer 2

rccharles

Level 6

12,957 points

May 20, 2010 3:40 PM in response to JulieJulieJulie

JulieJulieJulie wrote:
The number signs in your copy of the script have become outline style numbers ( 1. 2. 3. instead of # # #) This may indicate that your text editor is changing the content, or that pasting the script here is altering the content and I'm not sure which. I'll paste my copy of the script below.

EDIT: It's changing when pasted into this forum post.

Yes many characters get mangled.

Try

... ... your code here ... ...

Even with the code tag, a few characters get changed. So I wrote a translation applescript program. See:
Format a link and an Applescipt for posting ( anything for that matter ) .
http://discussions.apple.com/thread.jspa?threadID=2223315&tstart=0

Reply

Answer 3

Guy Burns Author

Level 2

367 points

May 20, 2010 6:14 PM in response to Guy Burns

Thanks for the new script. I ran a test on a real folder (~500 files, 150 MB, with several sub folders) and there were problems with both scripts. I have stopped testing the main folder and have narrowed down some of the problems to a sub folder. When I ran the script on this sub folder, duplicate files were not removed, but their file size was reduced to zero. Also, at the end of the run 7 files appeared in the trash with names like "index.xml...", "page thumb..." and "pkginfo".

I have uploaded a zip file to: http://www.mediafire.com/?ocmjzzufigo

Within that file are these files:

• The first script I used (renamed to Consolidate).

• The problem folder, +John FS+

• A folder with the files that appeared in the Trash. When I ran a +Toast > Compare+ before and after, I think these files were still in one of the +John FS+ folders, but were hidden.

Also, the script when run on the 150 MB folder went into a infinite loop. After 20 minutes, +Activity Monitor+ was showing Bash consuming 15-20% of CPU, 1 thread, and ~800KB of memory. I had to quit Terminal to stop it.

Reply

Answer 4

rccharles

Level 6

12,957 points

May 20, 2010 6:33 PM in response to Guy Burns

A folder with the files that appeared in the Trash. When I ran a Toast > Compare before and after, I think these files were still in one of the John FS folders, but were hidden.

A little terminal magic will reveal hidden files. Cut & paste the appropriate line.
Macintosh-HD -> Applications -> Utilities -> Terminal
# Show hidden files & folders in finder defaults write com.apple.finder AppleShowAllFiles TRUE ;killall Finder # Normal display. Hide hidden files & folders in finder defaults write com.apple.finder AppleShowAllFiles FALSE ;killall Finder

I'd let the terminal run longer. Run for 1meg & time. Run for 10meg & time. So you have some idea of timing.

JulieJulieJulie

Could be a buffer problem. Look at xargs

Robert

Message was edited by: rccharles

Reply

Answer 5

May 20, 2010 8:48 PM in response to Guy Burns

Ok two things... first is that it wasn't in an infinte loop - it's just sluggish due to the number of files.

First the script runs md5 on every file to get the file's checksum - that takes some time. Next, for every file it compares the checksum with ALL of the files' checksums (so for the first file it compares with all 500 files then for the second file it compares with all 500 files and so on which is 500 * 500 = 250,000 comparisons. It would be more efficient to sort the md5 sums first and only compare with those which are 'close' to the same sum, although that will require more programming naturally.

Second, for some reason I had the idea that your discs were full of only image and text files - things which are independent, single files. However the presence of a file named "pkginfo" in the output you posted indicates a bundle which is a folder containing many files - for instance an installer package such as Install.pkg or Install.mpkg. Bundles must be handled differently than individual files - removing a file from one bundle because the same file exists in another bundle is not a great idea since it leaves one bundle broken (such as an installer missing one file which is likely necessary to a successful install.)

So, if bundles (Quicktime components, iPhoto libraries, installer packages, ".app" applications, WAF/web archives, etc.) exist within your discs we must treat them differently by comparing bundles with bundles and specifically not comparing the individual files within bundles to individual files outside the bundle.

Perhaps you should tell us more about the types of items in your archive?

The file sizes which appear to have been reduced to zero... could you check them again via Get Info on any of the files? Sometimes when a file is renamed (moved) via the command line, Finder does not immediately display the change. I can't see any way that the script could overwrite a file in-place... in the trash certainly but the only thing it does in place is to rename and it appends an incrementing #xx when it does so, so it shouldn't be possible for it to zero out a file (but then seemingly impossible things -do- happen.)

Reply

Answer 6

May 20, 2010 9:41 PM in response to JulieJulieJulie

Apple's Pages document format is a bundle... and that's what's in the folder you posted. I'll work on avoiding the mangling of bundle contents. 🙂

Ok, a quick fix to stop it from dealing with any files within .pages bundles is to change one line...

filePaths=( $( find "${MainFolder}" -type f \! -name ".*" \! -name "Cross_Index.txt" ) )

to

filePaths=( $( find "${MainFolder}" -type f \! -name ".*" \! -name "Cross_Index.txt" \! -path "*.pages/*" ) )

@rccharles, I used your script for the formatting of those two lines, very nice work.

Revised script with above change and some visual output of progress is here:
http://www.mediafire.com/?0yqi0lbzijm

Message was edited by: JulieJulieJulie

Reply

Answer 7

May 21, 2010 10:26 AM in response to JulieJulieJulie

Further revised to ignore a variety of known bundles/packages.

Download the zipped file here:
http://www.mediafire.com/download.php?dymzzmiyddq

And source code also posted here (thanks to rccharles formatting AppleScript):
#!/bin/bash # Consolidate revised 2010-05-21 10:00 A.M. # Find folders within a specified top folder which are not bundles. Find files within the top level of each found folder. # Get MD5 checksums for each file and save an index of all file paths with the files' checksums. # Compare each file's checksum with all other files' checksums and if a duplicate is found, move it to the trash and remove the file from the list. # Compare each file's name with all other files' names and if a duplicate is found, append incrementing #X to the name prior to the last period (as in .jpg) declare tab=$'\t' declare -i count=0 declare -a fileSum filePaths # If the utility md5 is not present then just quit, it's needed for the checksums. [ ! -x /sbin/md5 ] && exit 2 # Main folder can be declared on the command line or if not then use /testpix MainFolder="${1:-/testpix}" # If the main folder doesn't exist, just quit. [ ! -d "${MainFolder}" ] && exit 3 # Set the internal field separator to newline to preserve spaces in file paths IFS=$'\n' echo "--> Finding files and ignoring known packages/bundles." # Find folders within $MainFolder excluding known packages and bundles folders=( $(find "${MainFolder}" -type d \! $ -iname "*.pages" -o -iname "*.app" -o -iname "*.component" -o -iname "*.pkg" -o -iname "*.mpkg" -o -iname "*.framework" -o -iname "*.plugin" -o -iname "*.kext" -o -iname "*.prefpane" -o -iname "*.qlgenerator" -o -iname "*.mdimporter" -o -iname "*.wdgt" -o -iname "iPhoto Library" -o -iname "iTunes" $ \! $ -ipath "*.pages/*" -o -ipath "*.app/*" -o -ipath "*.component/*" -o -ipath "*.pkg/*" -o -ipath "*.mpkg/*" -o -ipath "*.framework/*" -o -ipath "*.plugin/*" -o -ipath "*.kext/*" -o -ipath "*.prefpane/*" -o -ipath "*.qlgenerator/*" -o -ipath "*.mdimporter/*" -o -ipath "*.wdgt/*" -o -ipath "iPhoto Library/*" -o -ipath "iTunes/*" $ ) ) # Perform a 'find' within each of the folders for files in that folder (and not in subdirectories) for folder in ${folders[*]} ; do filePaths=( ${filePaths[*]} $( find "${folder}" -maxdepth 1 -type f \! -name ".*" \! -name "Cross_Index.txt" ) ) done echo "--> Generating list of MD5 checksums for file content." # Get an MD5 checksum for each file's combined content of both data and resource forks for file in ${filePaths[*]} ; do fileSum[${count}]=$( cat "${filePaths[${count}]}" "${filePaths[${count}]}/rsrc" | md5 | cut -d'=' -f 2) echo "${filePaths[${count}]}${tab}${fileSum[${count}]}" | tee -a "${MainFolder}/Cross_Index.txt" let count+=1 done echo # For each file, check for a duplicate checksum and if found, move the matching file to the user's trash folder # Rename files with duplicate names by appending #XX for ((i=0;i<${count};i++)) ; do [ -z "${filePaths[${i}]}" ] && continue echo "Checking checksum and name for: ${filePaths[${i}]}" dupecount=1 for ((j=0;j<${count};j++)) ; do [ -z "${filePaths[${j}]}" ] && continue if [ "${fileSum[${i}]}" = "${fileSum[${j}]}" -a ${i} -ne ${j} ] ; then mv "${filePaths[${j}]}" ~/.Trash && filePaths[${j}]='' && fileSum[${j}]='' elif [ $(basename "${filePaths[${i}]}") = $(basename "${filePaths[${j}]}") -a ${i} -ne ${j} ] ; then let dupecount+=1 dirname=$(dirname "${filePaths[${j}]}") fullfilename=$(basename "${filePaths[${i}]}") extension="${fullfilename##*.}" filename="${fullfilename%.*}" until [ ! -e "${dirname}/${filename} #${dupecount}.${extension}" ] ; do let dupecount+=1 done mv "${filePaths[${j}]}" "${dirname}/${filename} #${dupecount}.${extension}" filePaths[${j}]="${dirname}/${filename} #${dupecount}.${extension}" fi done done exit 0

Reply

Answer 8

May 21, 2010 3:38 PM in response to JulieJulieJulie

And a further revised version to speed-up checking of checksums and names.

After saving the index file with file paths and checksums, it is sorted by checksum value and then read in as an array. The loop checks the current file against the list starting at 1 greater than the current file (so it skips all files which were sorted below the current file's checksum) and stops when it reaches a non-matching checksum (so it skips all files whose checksums sorted above the current checksum.)

Then a temporary list of file names (basename) and paths is exported and read in again, sorted by basename. Another loop handles the duplicate name check in the same manner as the checksum loop - it doesn't check names which can't match.

http://www.mediafire.com/?vmwywmdhyfa

Reply

Answer 9

Guy Burns Author

Level 2

367 points

May 21, 2010 10:13 PM in response to JulieJulieJulie

Thanks for the updated faster script. I haven't got back earlier because I have been testing it.

Comments
1. Duplicate Pages files are not deleted, and altered Pages files with the same name do not receive a numbered suffix. All other file types I tested (jpeg, tiff, mp3, aiff, indd) are handled correctly. For me, this is not a problem as I don't use Pages anymore and have very few documents in that format.

2. Time to completion is now linear with number of files. I generated a test folder with 5 files (jpeg, tiff, mp3, aiff, indd) at top level; an identical sub-folder with file names changed; and another subfolder with same file names but altered contents. I called this folder +Test x1+ (15 files). Then I nested that in another folder called +Test x2+ (30 files). And so on up to +Test x 64+ (960 files).

*Times to completion*
60 files ... 8 seconds
120 files ... 12 seconds
240 files ... 25 seconds
480 files ... 46 seconds
960 files ... 101 seconds

20,000 files ... 8600 seconds (see below).

The script handles about 10 files/second, other than for certain sound, image, InDesign and mapping files. Extrapolating to my estimated 50,000 files means a completion time of several hours – better than the 20 weeks completion time of version 1!

*A REAL FOLDER TEST*
I have just tested a firewire-based archive (~60 GB, 20,000 files, containing only the most recent files). Using 60-80% CPU, the script took an hour to do the first step (Terminal said something about finding files), and an additional 75 minutes (1-10% CPU) to complete checksumming (I think), and finishing off by "checking names" in about 2 minutes.

*TWO MORE QUESTIONS*

Q1: Is there a Terminal command that will move all files contained within subfolders to the top level? At the end of the consolidation I'd like all files at the top level, so that I can archive them alphabetically to DVDs. This step is probably best done as a separate step and not in the script.

Q2: For some of the time I watched the script run. I noticed that it would take about one second for tiff and mp3 files (say, 10-50 MB), up to 10 seconds per InDesign files (some up to 200 MB), and about 10 minutes for one particular mapping file or folder (only a few hundred MB in it). I assume the script was calculating checksums. I use SilverKeeper as my archive software, and it doesn't use checksumming from what I understand, with the result that SilverKeeper can perform an incremental archive on my files approximately 10 times faster, for example, than Carbon Copy Cloner which does use checksums (see last post in my thread at http://forums.bombich.com/viewtopic.php?p=61546#61546). Is checksumming necessary? Could adequate file comparison be done just by modification date or file size? If so, I suspect it would significantly increase the processing speed.

Reply

Answer 10

Guy Burns Author

Level 2

367 points

May 22, 2010 1:50 AM in response to Guy Burns

I don't think I need a Terminal command to move all files to the top level.

1. Run the script (identical files are deleted, altered files with the same name acquire a numbered suffix)
2. Use the Finder to find (Command-F) all files in the folder with Size > 0.
3. Switch to List view. All files are listed.
4. Select all files and drag to the folder.
5. Delete all sub folders.
6. Done!

Message was edited by: Guy Burns

Reply

Answer 11

May 22, 2010 11:18 AM in response to Guy Burns

A new revision which now handles -known- bundles/packages properly (it will for instance, deal with duplicate .Pages files and their duplicated names.) The list of what is a known bundle is within the script content (which can be viewed in TextEdit.) It will continue to ignore unknown bundle types.

http://www.mediafire.com/?zzriygz0oyq

The method I used to accommodate bundles was to handle them separately from normal files (which means additional time of course due to a secondary loop.) Bundles are found with 'find' and then the entire content of both forks of all the files within each bundle is piped to md5 for a total checksum of the bundle's content. The bundle's path (the path to the top folder of the bundle only) and the checksum are then added to the list as though it were a normal file. The bundle's checksum and name is then compared in the same manner as any other file and the bundle is renamed or trashed if needed with one difference in how trashing is handled. Because a bundle is a directory, the move command 'mv' won't overwrite an existing, non-empty directory in the trash so the script will move it into trash and if a name conflict occurs, it moves it to the trash with a random number as a name (only when the checksum is an identical match of course.)

The script could be streamlined so that the second loop for bundles is not necessary - files or bundles could be handled within the primary loop to find files (essentially by testing whether the file to checksum is a file or a folder - and if it's a folder then all it's files are checksummed together.)

The difference in trash behavior could also be overcome, perhaps by using an osascript call to have SystemEvents actually delete the file the way the Finder / GUI would (in which case the file/folder goes into the trash and the system handles any name conflict.)

The script could also be streamlined by eliminating the status output per file (but this takes us back to the point at which the user would not be able to tell what's happening during long delays.) If confidence in the script's selection of duplicates is high then files could be removed (as in 'rm', poof gone) instead of moved to the trash.

There is a way to move files to the top-level folder... find /the/top/level/folder -mindepth 2 -type f -exec mv -n '{}' /the/top/level/folder \; comes to mind. That would be disastrous with bundles/packages though since all the files within the bundle would be moved into the top level folder and thus the bundle would be useless. In other words, don't do this if your data contains any bundles.

Another feature which might be nice would be to eliminate the hidden Directory Service files (.DS_Store) either prior to checksumming (because their content will affect the checksum of bundles - two identical bundles won't match if one is left in Icon view and the other switched to List view for instance) or after completion of all the other script steps (in preparation of removing empty folders.) Then after the .DS_Store files are gone, the command 'find -d the/top/level/folder -type d -empty -delete' would remove (as in poof gone) all folders which are completely empty.

In regards to checksums, they are used to determine if two file's content is identical without actually having to check each and every byte of data in both files. So the question should really be "Is it necessary to confirm that two files have identical content before declaring one a duplicate and trashing it?" In my opinion, YES! The file's directory meta-data (date, size etc.) could be altered in many ways whether or not the file content is and vice versa.

Imagine a file named test.txt with one line of text in it:
Hello

Opened, edited and saved with the altered text:
Howdy

Both are 5 bytes yet not a duplicate. Normally the file system would show a different date/time for the second file but it is possible to use the command 'touch' to alter the date and time of either to match the other. It's also possible to experience directory corruption which could result in the dates being lost (and possibly the names as well.) Only a comparison of the actual content of the two files can be relied upon to indicate that they have the same content. The problem is that if the files are very large (a full-length movie for instance) then a byte-by-byte comparison would take a very long time. To speed-up the process, a checksum can be calculated for each file and then compared. There are different methods of generating checksums. MD5 is slower than CRC and faster than SHA1. In research experiments it has been shown that it is possible for md5 to experience a collision (where two files with -different- content have the same checksum) however, I've never seen an md5 collision occur in the real world (and as far as I know, neither has anyone else.) So I used md5 checksums for the file-content comparison because it is more reliable than crc and faster than sha1 (but those and other options are certainly available.)

The time needed to generate the checksum is dependent on the size of the data being checksummed.

As Mike Bombich pointed out in the thread to which you linked, resource forks are another issue and since a resource fork can be modified without the file date being changed it becomes more important to check the content than the date -IF- you are concerned about the content of the resource forks (custom icons, attributes such as whether a folder is a package/bundle for instance, the retained 'Always open with...' program selection for the file, Finder comments etc. etc.)

There is another level of metadata - ownership and permissions and ACLs (Access Control Lists), which we really haven't bothered with at all (most likely because you are the sole user of your computer and thus all your documents are 'owned' by your user account.) Some other user might prefer that two otherwise identical documents, one with customized permissions and one without be handled differently.

It would also be possible to replace duplicate files with a Finder alias or symlink (similar to an alias) to the effect that all the files would appear to still be exactly where they had been in the archive, but identical duplicates would be removed and merely linked to one copy of the file and thus space is saved. This could be desirable or undesirable depending on one's point-of-view (editing the document from within two different folders is changing the same document, not two copies - which could be a problem for some users.)

No worries on the timing of your responses, I'm fairly patient and typically have a long attention span. 🙂

Message was edited by: JulieJulieJulie

Reply

Answer 12

May 22, 2010 11:42 AM in response to JulieJulieJulie

I forgot to mention that if file content is not compared (by checksums or byte-by-byte comparison) another problem arises which is that file dates may change even when the content is identical. Downloading a file from mediafire for instance, does not preserve the date of the file (although the file within a zip would have the date preserved, that's actually an optional feature of zipping though.)

So if the original file is in the archive, then the file is downloaded from email or web hosts, more than one copy of the same file can exist with different dates. Checking only the dates would result in otherwise identical duplicates being kept - even though the entire content of the files would be the same.

Reply

Answer 13

Guy Burns Author

Level 2

367 points

May 22, 2010 7:35 PM in response to Guy Burns

+Consolidate 6+ passed all tests. I added Pages files to my +Test x32+ folder and all is okay. The script took 56 seconds to process 3.5 MB across 576 files (previously 480 files, but now an additional 32x3 Pages files).

*Files to Top Level*
A note re moving to top level (my previous post): "Find size > 0" does not find Pages files.

I'll start a new thread about moving files, so that this one doesn't get too cluttered.

Reply

Answer 14

May 23, 2010 1:45 AM in response to Guy Burns

Consolidate7

http://www.mediafire.com/?y52tzwetnz3

Includes a quick add-on of routines to find files not bundles and move them to the top-level folder. Then finds known bundles and moves them up to the top-level as well. Then deletes .DS_Store files from the subfolders and finally removes subfolders which are completely empty.

Reply

Answer 15

May 23, 2010 12:00 PM in response to JulieJulieJulie

Consolidate8
http://www.mediafire.com/?nzy5znnfnjj

No major changes to the functionality, just cleaning up the code a bit. The filename is now "consolidate8.command", it is still a text file but the .command extension allows it to be double-clicked from Finder. It also has it's "Always open with..." selection set to Terminal (in it's resource fork.)

When run, the script checks for the top-level folder supplied on the command line and if not found (or if double-clicked) it prompts the user to type a path or drag in a folder to the Terminal window.

I made some drastic changes to the 'find' commands - any folder with a period in it's name is now treated as a bundle (as are folders with the name "iTunes" and "iPhoto Library".) Hopefully that will catch all bundles.

.DS_Store files are now ignored during checksum of bundle content.

Excerpt of code:
########## ### Find items echo "--> Finding files and ignoring any folder whose name contains a period and it's content (potential packages)." folders=( $(find "${MainFolder}" -type d \! $ -name "*.*" -o -name "iPhoto Library" -o -name "iTunes" $ \! $ -path "*/*.*/*" -o -path "*/iPhoto Library/*" -o -path "*/iTunes/*" $ ) ) # Perform a 'find' within each of the folders for files in that folder (and not in subdirectories) for folder in ${folders[*]} ; do filePath=( ${filePath[*]} $( find "${folder}" -maxdepth 1 -type f \! -name ".*" \! -name "Index*.txt" ) ) done echo "--> Generating list of MD5 checksums for file content." # Get an MD5 checksum for each file's combined content of both data and resource forks for file in ${filePath[*]} ; do fileSum[${count}]=$( cat "${filePath[${count}]}" "${filePath[${count}]}/rsrc" | md5 | cut -d'=' -f 2) echo "${filePath[${count}]}${tab}${fileSum[${count}]}" | tee -a "${MainFolder}/Index.txt" let count+=1 done echo "--> Finding potential packages/bundles." # Find folders within $MainFolder which are possible packages and bundles folders=( $(find "${MainFolder}" -type d $ -iname "*.*" -o -iname "iPhoto Library" -o -iname "iTunes" $ \! $ -ipath "*/*.*/*" -o -ipath "*/iPhoto Library/*" -o -ipath "*/iTunes/*" $ ) ) echo "--> Generating list of MD5 checksums for potential packages/bundles." # Perform a 'find' within each of the folders for files in that folder and in subdirectories for folder in ${folders[*]} ; do fileSum[${count}]=$( find "${folder}" -type f \! -name ".DS_Store" -exec cat '{}' '{}/rsrc' \; | tr "\n" " " | md5 ) filePath=( ${filePath[*]} "${folder}" ) echo "${filePath[${count}]}${tab}${fileSum[${count}]}" | tee -a "${MainFolder}/Index.txt" let count+=1 done echo "--> Sorting list of files and potential bundles by checksum." unset fileSum[*] filePath[*] filePath=( $(sort -t "${tab}" +1 "${MainFolder}/Index.txt" | cut -f 1) ) fileSum=( $(sort -t "${tab}" +1 "${MainFolder}/Index.txt" | cut -f 2) ) [ ${#filePath[*]} -ne ${#fileSum[*]} ] && echo "UHOH1" && exit echo

Reply