Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Can I copy files – but with certain restrictions?

I have just finished a large project, the archives of which involves about 5000 "base" files, stored on about 80 CDs and 50 DVDs, involving about 50,000 files in total. Each of the "base" files may have had up to 30 incremental versions. i.e. a certain text file may have undergone revision 23 times, and each revision was saved and archived to (probably) a different disk, with a different suffix – a, b, c and so on. But sometimes the suffix didn't change even though the file was edited. I might have done a bit more dust removal on an image and just overwrote the old file (already archived), and so the new one was archived on a different disk.

I now have 130 disks from which I would like to extract all the files and collapse them to one large archive that will probably span about 20 disks by the time I delete some files not needed. That way I can easily search for all versions of, say, GB097, by going to the particular DVD that has the "G" files on it. Up would come:

GB097
GB097a
GB097b
GB097b-1
GB097b-2
GB097c
... and so on.

This is what I would like to do:

1. Grab the first archive disk, open every folder, and copy all the files to the one folder on a hard drive.

2. Open the second disk and repeat step (1), but with these two provisos.

(a) If a file is identical to a previously copied file (maybe I archived it twice), the file isn't copied. However...

(b) If a file has the same name as a previously copied file, but the data within that file is different (i.e. I removed some dust from an image file, but left the name unchanged), I'd like that file to be copied with a numbered suffix, the same way that Trash treats identically named files.

Any suggestions how I could do this?

G5 iSight, Mac OS X (10.4.11)

Posted on May 18, 2010 3:27 AM

Reply
92 replies

May 26, 2010 3:24 AM in response to Guy Burns

I ran the script over a backup copy of my hard-disk Archive folder (10,958 files, 59 GB), and I ended up with 9598 files after 5414 seconds. So I must have 1360 duplicate files in my Archive folder. But I'm not game enough to run the script on original files yet, so as to get rid of those duplicates. First I'd like to see what was removed so that I can randomly check a few to make sure they are really duplicates.

Q: Is it possible from the Terminal output, or the Index_Archive.txt file, to find out what files were removed? In a previous post you said:

+The index file... is there for you in the event that you discover a file is missing because it was removed as a duplicate; you can open the Index_Archive.txt and command-f to find the file by name, copy it's checksum and command-f command-v to find what other file had the same content.+

So I'd have to know which files were removed before I could find out which files were removed -- if I understand the explanation correctly. I don't understand the usefulness of Index_Archive.txt, but maybe I'm missing something.

It might be a good idea if the script generated a list of removed files. That way a user could double check the script's operation -- at least initially, until confidence in the script was gained.

One more thing: I think Terminal ran out of window space, because when I scrolled back to the top, the readout starts at "Checking name for...".

May 26, 2010 3:35 AM in response to Guy Burns

I just ran SilverKeeper as an inverse to consolidate14 for my Archive folder: to put back all the removed files, and to remove the files that consolidate14 renamed. Things are quick when a program doesn't bother with checksums. SK took a little more than a minute to get everything back the way it was, ready for me to run more tests.

... at least I hope everything is back the way it was.

Message was edited by: Guy Burns

May 26, 2010 10:26 AM in response to Guy Burns

Terminal defaults to a 10,000 line buffer after which, lines begin "scrolling off" the top and are lost. To increase the buffer, pull down the Terminal menu on the menu bar at the top of the screen and select "Window Settings..." and choose "Buffer" from the pop-up and then choose either unlimited or a higher limit. Click the "Use settings as default" button to retain that preference for all new Terminal windows.

The index file's purpose...
Suppose you had a project file, for instance "Project-A.txt" and you duplicated the file and renamed the duplicate to "Project-Q.txt". For whatever reason, the file is never changed. When the script runs, it determines Q to be a duplicate of A and moves Q to the trash, as it should. Later, you look for "Project-Q.txt" and realize that it is missing. By opening the Index_Archive.txt file, you can search for the name "Project-Q.txt" and see the corresponding checksum value. Search in the file again for that checksum value and you will see which other file(s) had identical content... in this example you would find "Project-A.txt" to have the same checksum. If you wish to recreate the missing Q file, you can do so by duplicating the A file again.

Consolidate 15
Now outputs a list of trashed files.
http://www.mediafire.com/?3y2zxjjxbgg

You could use du (disk usage) to check the total size of the folder before and after the script which might show why SilverKeeper is able to restore the deleted duplicate files so quickly... Use this command in Terminal before and after the script (replacing the path /drag/folder/to/window with the path to your folder.)

du -chs /drag/folder/to/window

Message was edited by: JulieJulieJulie

May 26, 2010 10:46 AM in response to Guy Burns

Guy Burns wrote:
I ran the script over a backup copy of my hard-disk Archive folder (10,958 files, 59 GB), and I ended up with 9598 files after 5414 seconds. So I must have 1360 duplicate files in my Archive folder. ..."


Unless the total includes folders. And that may be including invisible .DS_Store files which the script removes as well. So the actual number of duplicates may be less.

May 26, 2010 12:45 PM in response to JulieJulieJulie

Consolidate16.command.zip
http://www.mediafire.com/?znztznhrw1z

Now defaults to regrouping map files which are present in the top-level folder only (not in subfolders) into new folders based on the map file filename. First, .frq files are found and their filenames used as the basis for new folders into which files with the same base portion of the frq filename are moved. Second, the same process is repeated for .shp files. This allows grouped sets with frq files to be moved into one folder and then individual sets without frq files to be moved into folders as well.

New map folders are created with ".map" at the end of the folder name (so that during further runs of the script on those folders, they will be treated as bundles.)

This option can be disabled on the command line via -m. If -m is not used on the command line, the script will query the user as to whether to disable this option.

May 26, 2010 2:35 PM in response to Guy Burns

One more thing: I think Terminal ran out of window space, because when I scrolled back to the top, the readout starts at "Checking name for...".


You could try iTerm instead of Terminal.
http://iterm.sourceforge.net/

The terminal script command will copy most output from the terminal to a file.

for example:
mac $ script ~/Desktop/myOut
Script started, output file is /Users/mac/Desktop/myOut
in /etc/bashrc
in /bin/bash actually .bashrc
in /Users/mac/config/profile.bash claims to be /bin/bash
mac $ pwd
/Users/mac/Desktop
mac $ exit
exit
Script done, output file is /Users/mac/Desktop/myOut
mac $ cat ~/Desktop/myOut
Script started on Wed May 26 17:31:09 2010
in /etc/bashrc
in /bin/bash actually .bashrc
in /Users/mac/config/profile.bash claims to be /bin/bash
mac $ pwd
/Users/mac/Desktop
mac $ exit
exit

Script done on Wed May 26 17:31:22 2010
mac $

May 26, 2010 9:25 PM in response to Guy Burns

*Files Removed*
I used EasyFind to find the number of files before and after. Folders and invisibles were not included in the listing so I have ~1300 duplicated files in that particular archive. ( EasyFind was recommended by BDAqua in my other post about finding all files within a folder, for manual moving to the top folder).

Testing done with folders kept intact.

*Times to completion (consolidate15)*
1 x 18 files ... 3 sec ( +Test x1+, 6 removed)
8 x 18 files ... 19 secs ( +Test x8+, 132 removed)
32 x 18 files ... 75 secs ( +Test x32+, 564 removed)
1 x 400 files ... 27 secs ( Transcripts, 0 removed)

*Times to completion (consolidate16)*
1 x 18 files ... 3 sec ( +Test x1+, 6 removed)
8 x 18 files ... 23 secs ( +Test x8+, 132 removed)
32 x 18 files ... 83 secs ( +Test x32+, 564 removed)
1 x 400 files ... 28 secs ( Transcripts, 0 removed)

Consolidate16 is slightly slower than consolidate15.

*Trashed List*
Problem: When the script is run this new file appears in the Test folders, but not in the Transcripts folder. I suppose because no files were removed. Still, I think the trashed list should be there with a comment about no items removed (see suggestion 1).

Suggestions:
1. At the start (or end) of the trashed list it would good to have a comment: "Total number of items trashed = ..."

2. The trashed list and index can be hard to find in amongst thousands of others. I suggest they have a "ZZ" at the start of their name (or other method) to force them to the end of the list of files (or the start).

*Mapping Data*
It's probably not worth spending much time trying to accommodate mapping data because it is more complicated than I indicated. And I don't intend using the data again, so I'll archive my mapping folder as is. QGIS was a pain to use -- full of bugs at that stage -- and I couldn't even open GRASS, not even when I tried using OPENGRASS. The only piece of software I've encountered that I couldn't open.

With the version of QGIS I used, if you moved, renamed, spoke to it in the wrong tone of voice, or did basically anything to the folders or their contents, it asked you to find the moved files. And then it would often crash anyway.

Plus, I only told half the story. There are a whole range of other files -- raster data -- which I didn't use (I didn't know how to use), but the raster data is there amongst the mapping data I have. This is a list of some of the file types I saw when I looked inside the mapping folder: hdr, dbf, sbn, shx, sbx, met, bil, aux, shp...

There are about 1000 files in the 1:1,000,000 map of Australia; then I have the 2.5M, 5M and 10M maps, and quite a number of the 250K maps, plus a small amount of data I generated. Hundreds of subfolders full of thousands of files. Overall, I don't think it's worth trying to accommodate mapping data, and besides, I wouldn't know how to test that the relocated files are still functional.

May 27, 2010 2:17 AM in response to Guy Burns

The time will vary with the number of files, number of duplicates, number of duplicate names, size of files, disk throughput, processor load and options selected for the script.

I changed Consolidate17 for the trashed list and archive index names to be preceded by two spaces which should sort them first alphabetically.

The mapping data regrouping is done based on the first portion of the frq and shp filenames (before the period and filename extension.) So, if all the grouped files have the same beginning name and different extensions then it moves them into their own folder. Wild-card matching is used for the extensions so it should still find all those you listed.

Consolidate17
http://www.mediafire.com/?yn525azwnoo

May 27, 2010 3:14 AM in response to Guy Burns

This script testing is a complicated business. I've had to go back to consolidate14 because consolidate15 failed to get through the largest of my test folders -- the 59 GB folder. Twice it stopped running during checksumming, after more than an hour. When the Bash %CPU in +Activity Monitor+ dropped to zero for more than 10 minutes I assumed the script had stopped.

I'm running consolidate14 at the moment to see if it gets through the 59 GB.

May 27, 2010 4:50 AM in response to Guy Burns

Consolidate14 stopped at the same file as consolidate15. The Terminal line was:

+Checking checksum for: /Volumes/G5-Data/Backup To G5-Data/C-Data/AA Archives//Original Photos & Sound/Sound/AA Not yet uploaded/LordB/LordB2002.mp3+

Given that consolidate14 didn't stop last time while testing Archive folder, and given that the only change I have made is to tell Terminal to have "unlimited scrollback", I immediately had the suspicion that the script stopped because of some connection with Terminal's default 10,000 line limit.

I don't know how to count lines in a Terminal or TextEdit document, so I counted the number of lines in a window at about 70, jumped up window by window for 20 windows, and that looked to me to be approximately 1/6 of the total window length. Rough guess at number of lines = 70 x 20 x 6, close enough to 10,000 for me.

Is it possible that by setting "unlimited scrollback", the script stops at 10,000 lines?

The script definitely stopped. No movement in Terminal, and Activity Monitor hasn't showed any movement in either of the Bashes for 20 minutes. While running, up to four Bashes came into play, and a cut, md5 and cat here and there.

I'll run again, after resetting to 10,000 lines.

May 27, 2010 11:33 AM in response to Guy Burns

In Terminal, pull down the File menu on the menu bar at the top of the screen and select "Save Text As..." to save the output into a file. To count the lines, type 'wc ' (there is a space after wc which stands for word count) and then drag the saved output file into the terminal window, then click back in the terminal window and press return. The wc output is the count of lines, words, and characters.

I just added outputting the total count of filenames/checksums to Consolidate18 (so that we will know how many total files the script is dealing with.)

At the point at which the script appeared to have stopped, it would have been doing the following:
1] Iterating through an array of filepaths and checksums. (essentially checking the value of each item in a list of many items.)
2] Checking whether the current file still exists at the filepath.
3] Comparing the checksum of the current file with the checksum of the next file and those after until there is no match.
4] If a matching checksum is found, moving the duplicate file to the trash; removing the filepath and checksum for that file from the array; incrementing the count of total trashed files.

I'm not sure what the problem is in this case.

Things to check:

Does this file still exist at the filepath: /Volumes/G5-Data/Backup To G5-Data/C-Data/AA Archives//Original Photos & Sound/Sound/AA Not yet uploaded/LordB/LordB2002.mp3 ?

Is there anything odd about the file "LordB2002.mp3"? For example: is it's size zero bytes or strangely large? Are you able to select the file in Finder then duplicate it (indicating that there is no read error occurring when copying the file.) Is it actually a file, not a folder?

In the Index file, search for "LordB2002.mp3" and look at the checksum value for the file. Does anything look odd (for instance is the checksum value field blank or zero or something unusual)?

What about the checksum values of the lines after that file in the index - is the same checksum value showing up on many lines following that file?

Is there a file named "LordB2002.mp3" in the trash?

If you run the script on a folder containing only this file, does the script behave normally?



In your prior post, you stated "Twice it stopped running during checksumming, after more than an hour."

Does that mean while it is generating checksums, or while it is comparing checksums?

My guess would be that it hit a folder, which has a period in it's name, and which contains either lots of small files or few very large files (either case would slow the md5 utility down notably.)

Can you locate the item it stopped on and determine if it is a file or folder etc?

May 27, 2010 7:34 PM in response to Guy Burns

A new problem: I can't stop consolidate14 running. I opened it this morning, ready for a test run, but then changed my mind and thought I'd check the number of lines in a text file I saved of one of the aborted runs. So I tried using that "wc " command you told me about. "wc " failed several times until I realized that consolidate14 was still running.

1. I opened a new Terminal shell and this appeared:

+Consolidate14 script revised 2010-05-25 10:00 P.M.+

+Folder not specified. Type the path to the top-level folder or drag+
+it into this Terminal window, then click back in this window and press return:+

2. I typed Command-dot and this appeared:
^C
logout
+[Process completed]+

3. I opened a new shell and I was back to 1.

4. I tried New Command; was given a little window into which I typed cal 2010. That worked, so I thought I'd try "wc " in the New Command window, followed by a folder drag. This is what came back:

+wc: /Volumes/C-Data/Consolidate: open: No such file or directory+
+wc: Test: open: No such file or directory+
+wc: Files/Failed: open: No such file or directory+
+wc: consolidate14run.txt: open: No such file or directory+
+0 0 0 total+
+[Process exited - exit code 1]+

5. Tried 3 again, and I was back to 1.

6. Quit Terminal and reopened, and I was back to 1.

7. From Terminal>File I sent "break", "reset" and "hard reset". No effect.

I also tried typing in: kill, terminate, stop, end, please finish, go away...

I'm impressed with the tenacity of consolidate14. It's like that monster in Alien that couldn't be gotten rid of. No matter what you do it's always there.

I've almost convinced myself that if I reboot OSX, consolidate14 won't be there, but I have a suspicion that if I do so it'll be hiding in some dark recess of my computer to come and get me when I'm least expecting it. I never thought I'd be starring in my own horror movie: me against consolidate14.

Sigourney Weaver got rid of her little demon by tossing it into outer space. To save me restoring my OSX from a cloned disk image (my version of an outer-space toss), how do I stop consolidate14 from within OSX?

May 27, 2010 8:04 PM in response to Guy Burns

Check the Terminal menu on the menu bar at the top of the screen > Preferences... Is either "Execute this command..." or "Open a saved .term file" selected??

It appears that wc got an unescaped path, it tried to access:
/Volumes/C-Data/Consolidate
then it tried:
Test
then it tried:
Files/Failed
then it tried:
consolidate14run.txt

So it appears that the path should have been:
/Volumes/C-Data/Consolidate Test Files/Failed consolidate14run.txt

Normally that would be escaped like so when the file is dragged into the window:
/Volumes/C-Data/Consolidate\ Test\ Files/Failed\ consolidate14run.txt

The backslash character prior to each space character tells the system to ignore the space, that it is not a field separator indicating another filename.

The output you posted shows that the spaces weren't escaped and the whole path wasn't quoted either. If you copy and paste the path, quote it like this:
"/Volumes/C-Data/Consolidate Test Files/Failed consolidate14run.txt"

Otherwise just drag the file into the window and it will be escaped automatically like this:
/Volumes/C-Data/Consolidate\ Test\ Files/Failed\ consolidate14run.txt

(You can also paste the escaped path into Terminal, in which case do NOT quote it.)

May 27, 2010 9:51 PM in response to Guy Burns

Under Terminal > Preferences, "Execute the default login shell.." is selected.

Maybe I should rephrase the problem: Terminal won't do anything anymore (not even my cherished cal and +sudo rm -R+ ) because it always shows:

+Last login: Fri May 28 14:24:15 on ttyp1+
+/Volumes/C-Data/Consolidate\ Test\ Files/consolidate14.command; exit+
+Welcome to Darwin!+
+jenny-pearces-imac-g5:~ Jenny$ /Volumes/C-Data/Consolidate\ Test\ Files/consolidate14.command; exit+

+Consolidate14 script revised 2010-05-25 10:00 P.M.+

+Folder not specified. Type the path to the top-level folder or drag+
+it into this Terminal window, then click back in this window and press return:+

I can drag a folder into Terminal, the script runs and finishes, but Terminal won't accept commands like it used to.

I assume a reboot will fix the problem, but isn't there a simple way to tell Terminal to terminate whatever it is doing, and accept ordinary commands again? Then I can get back to testing consolidate14, after I run "wc ". But I can't run "wc " because Terminal is stuck in consolidate14.

Rebooting is inconvenient because I'm uploading files to Mediafire most of every day, and I'd rather not interrupt that process.

As for the file that the script stopped at: it's an hour long mp3, a single file of about 35 MB, it can be opened and played, duplicated, renamed, moved, and there is a copy in the Trash but I can't be sure how it got there.

Message was edited by: Guy Burns

Can I copy files – but with certain restrictions?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.