Can I copy files – but with certain restrictions?

Question

Level 2

367 points

Can I copy files – but with certain restrictions?

I have just finished a large project, the archives of which involves about 5000 "base" files, stored on about 80 CDs and 50 DVDs, involving about 50,000 files in total. Each of the "base" files may have had up to 30 incremental versions. i.e. a certain text file may have undergone revision 23 times, and each revision was saved and archived to (probably) a different disk, with a different suffix – a, b, c and so on. But sometimes the suffix didn't change even though the file was edited. I might have done a bit more dust removal on an image and just overwrote the old file (already archived), and so the new one was archived on a different disk.

I now have 130 disks from which I would like to extract all the files and collapse them to one large archive that will probably span about 20 disks by the time I delete some files not needed. That way I can easily search for all versions of, say, GB097, by going to the particular DVD that has the "G" files on it. Up would come:

GB097
GB097a
GB097b
GB097b-1
GB097b-2
GB097c
... and so on.

This is what I would like to do:

1. Grab the first archive disk, open every folder, and copy all the files to the one folder on a hard drive.

2. Open the second disk and repeat step (1), but with these two provisos.

(a) If a file is identical to a previously copied file (maybe I archived it twice), the file isn't copied. However...

(b) If a file has the same name as a previously copied file, but the data within that file is different (i.e. I removed some dust from an image file, but left the name unchanged), I'd like that file to be copied with a numbered suffix, the same way that Trash treats identically named files.

Any suggestions how I could do this?

G5 iSight, Mac OS X (10.4.11)

Posted on May 18, 2010 3:27 AM

Reply

Answer 1

BDAqua

Level 10

239,369 points

May 18, 2010 5:40 AM in response to Guy Burns

2a would be easy with Tri-Backup for me 3 different ways...

http://www.tri-edre.com/english/tribackup.html

2b it seems would require maybe Automator & Applescript.

Reply

Answer 2

May 18, 2010 5:12 PM in response to Guy Burns

I wrote a bash script to handle duplicates files and duplicate filenames...

http://pastebin.com/xGGGhzZ9

To use it, copy each CD or DVD into a single master folder. Then run the script (from Terminal) and supply the path to the master folder.

It will generate a list of the files with their md5 checksums. Next it will compare each checksum against all others in the list and if duplicates are found, they are moved to the trash can (with no concern for overwrites.) Next it compares the name of each file with all others in the list and if duplicate names are found, #XX is appended to the filename prior to the extension. If the new filename already exists at that path then the number is incremented (photo #1.jpg, photo #2.jpg and so on.)

Message was edited by: JulieJulieJulie

Message was edited by: JulieJulieJulie

Reply

Answer 3

Guy Burns Author

Level 2

367 points

May 18, 2010 6:33 PM in response to JulieJulieJulie

Thanks for the bash script. I'll be able to give it a thorough testing if I can get it working. I've never heard of bash, and I can only do two things with Terminal – get a calendar and remove stubborn files via "sudo rm -R ". Anything else is generally beyond me. Really. But I'll give it a go.

When I clicked on "Download" on that web page you gave, I ended up with a text file called xGGGhzZ9.txt. I pasted the contents in Terminal and it went silly. Gave about 20 "pings" and then stopped.

Two questions:

1. How do run this script?

2. What is the easiest way to find the path of a folder? I think I can do so by dragging a folder into the "Compare" utility window of Toast which comes up with something like "/Volumes/C-Data/AA Archives" for an existing folder I have. Is that what is needed?

I have done a bit of research into running scripts, and one explanation I found (I still don't follow any of it) is listed below. If this is what it takes to run a bash script, I'll have to give it a miss!

+To run a non-executable bash script, use: bash myscript+

+To start an executable (which is any file with executable permission); you just specify it by its path:+

/foo/bar
/bin/bar
./bar

+To make a script executable, give it the necessary permission:+

+chmod +x bar+
./bar

+When a file is executable, the kernel is responsible for figuring out how to execute it. For non-binaries, this is done by looking at the first line of the file. It should contain a hashbang:+

+#! /usr/bin/env bash+

+The hashbang tells the kernel what program to run (in this case the command /usr/bin/env is ran with the argument bash). Then, the script is passed to the program (as second argument) along with all the arguments you gave the script as subsequent arguments.+

+That means every script that is executable should have a hashbang. If it doesn't, you're not telling the kernel what it is, and therefore the kernel doesn't know what program to use to interprete it. It could be bash, perl, python, sh, or something else. (In reality, the kernel will often use the user's default shell to interprete the file, which is very dangerous because it might not be the right interpreter at all or it might be able to parse some of it but with subtle behavioural differences such as is the case between sh and bash).+

Reply

Answer 4

rccharles

Level 6

12,957 points

May 18, 2010 8:53 PM in response to Guy Burns

What is the easiest way to find the path of a folder? I think I can do so by dragging a folder into the "Compare" utility window of Toast which comes up with something like "/Volumes/C-Data/AA Archives" for an existing folder I have. Is that what is needed?

Drag folder icon to terminal window & drop. Path appears.

Robert

Reply

Answer 5

rccharles

Level 6

12,957 points

May 18, 2010 9:19 PM in response to rccharles

Be sure to get the second post of the script ( I guess. )

I did a copy of the second post. Download did not work for me.

Macintosh-HD -> Applications -> Utilities -> Terminal
pico comp.bash

paste into pico command + v

control + x
y
press return

bash comp.bash ~/Desktop/see

I am not sure that I have the input correct.

Robert

------------

# Will show you what Unix thinks the files is.
# the leading - means a file
# the leading d means directory
mac $ ls -lF /Volumes/COPYIT/
total 38080
...
-rwxrwxrwx 1 mac staff 26698 Oct 12 14:59 ae.jpeg*
-rwxrwxrwx 1 mac staff 73694 Oct 12 14:30 alltiff*
drwxrwxrwx 1 mac staff 4096 Apr 11 2007 answers/
...
# l is long
# F is type of file where / is directory
# a is all as in ls -laF
mac $ ls -lF /Volumes/COPYIT/answers/
total 80
-rwxrwxrwx 1 mac staff 920 Apr 22 2007 FINDER.DAT*
-rwxrwxrwx 1 mac staff 2797 Mar 20 2007 Fix Vista MBR.html*
...

#cd is change directory
mac $ cd /Volumes/COPYIT/answers/
# pwd is print working directory
mac $ pwd
/Volumes/COPYIT/answers
You can rename a file by doing
mv oldname newname

cp is copy
see
man cp
to copy a directory example:
cd
cd ~/Desktop
pwd
cp -R /Volumes/COPYIT/answers newanswers

Reply

Answer 6

Guy Burns Author

Level 2

367 points

May 18, 2010 9:54 PM in response to rccharles

Thanks for the suggestions, Robert. I got as far as this line:

bash comp.bash ~/Desktop/see

and came to a halt. Now when I try and go through the steps again, the listing of the script is already in Terminal and I can't get rid of it to allow me to try again.

Maybe JJJ can tell us if the second script works and can give some instructions on how to run it.

Reply

Answer 7

May 19, 2010 11:01 AM in response to Guy Burns

Copy and paste the script beginning with the hash-bang line (#!/bin/bash) and ending with the exit 0 line, into a TextEdit window. Pull down the Format menu on the menu bar at the top of the screen and select "Make Plain Text" if the option is available (if not, then it is already plain text.) Save the file as dupeless.txt for instance (the name doesn't really matter) on the Desktop.

---Or---
If you prefer to download the script rather than copy and paste via TextEdit then the line endings will be wrong (returns rather than newlines). To fix the issue run Terminal and enter:
tr "\r" "\n" < drag-the-file-to-the-window > ~/Desktop/dupeless.txt
-------

To set the execute bit on the file (which tells the system that the file is allowed to run as a program) in Terminal type...
chmod +x ~/Desktop/dupeless.txt

To run the script, drag it into a terminal window and the path to the script will be entered. Click back in the Terminal window and type a space and then drag in the folder to be scanned for duplicates. Click back in the terminal window and press return.

As an example, the command line would look like this:
/Users/julie/Desktop/dupeless.txt ~/Desktop/pixtest

Reply

Answer 8

rccharles

Level 6

12,957 points

May 19, 2010 1:14 PM in response to Guy Burns

I'd follow Julie's excellent post.

Before running the script, empty the trash can. Run the script. Look in the trash can to see what has been removed.

bash comp.bash ~/Desktop/see

and came to a halt.

You may not see any messages.

No news is good news. Many terminal programs do not give any printed output. When you see the next prompt the program has completed successfully.

Now when I try and go through the steps again, the listing of the script is already in Terminal and I can't get rid of it to allow me to try again.

The script has already been installed. There is no need to copy it again.

Robert

Reply

Answer 9

Guy Burns Author

Level 2

367 points

May 19, 2010 6:56 PM in response to Guy Burns

Thanks for the help, but on my system it is no-go. This episode has turned from a timesaver to a challenge, but at least I'll know a little about bash when I've finished.

I have uploaded all files connected with what I did to: http://www.mediafire.com/?nn2i2iwjijz

Anyone interested in helping out may have to download a zip folder available at the link.

This is what I did:

1. I went to http://pastebin.com/xGGGhzZ9, clicked in the second listing, selected all, and copied.

2. I opened a new iText Express document, pasted text, converted to plain text, and saved as BashTest on Desktop.

3. Duplicated BashTest to keep the original intact. Renamed the original +BashTest (original)+. Renamed +BashTest copy+ (the duplicate) to BashTest – the one I am going to test.

4. I made a test folder called Duplicates with a text file inside called "A"; and I created two more folders alongside "A": one called +A (changed)+ that has an altered version of "A" inside; and one called +A (identical)+ that has "A" inside.

5. I duplicated the entire folder and got a new folder called +Duplicates copy+, the one I was about to do the tests on.

6. The Terminal window was already full of stuff, so I selected "New Shell".

7. In Terminal I typed: "chmod +x " (including the space) and then dragged BashTest from the Desktop into the Terminal window (to make sure I had the correct file path), and then pressed Return.

8. I again dragged BashTest from the Desktop to the Terminal window, typed one space, then dragged the folder +Duplicates copy+ to the Terminal window, then pressed Return.

9. I checked the Trash. One copy of "A" appeared in the trash.

10. I checked +Duplicates copy+. Both identical copies of "A" were gone which is not what should happen.

QUES: Where did I go wrong?

*Terminal Listing*
Last login: Thu May 20 10:52:44 on ttyp1
Welcome to Darwin!
jenny-pearces-imac-g5:~ Jenny$ chmod +x /Users/Jenny/Desktop/BashTest.txt

jenny-pearces-imac-g5:~ Jenny$ /Users/Jenny/Desktop/BashTest.txt /Users/Jenny/Desktop/Duplicates\ copy/

/Users/Jenny/Desktop/BashTest.txt: line 1: md5sum: command not found
cat: /Users/Jenny/Desktop/Duplicates copy//A (changed)/A.rtf: Bad file descriptor
/Users/Jenny/Desktop/BashTest.txt: line 1: md5sum: command not found
/Users/Jenny/Desktop/BashTest.txt: line 1: md5sum: command not found
jenny-pearces-imac-g5:~ Jenny$

Reply

Answer 10

May 19, 2010 7:56 PM in response to Guy Burns

Oops, sorry md5sum is not included with OS X. Change one line to use 'md5' instead of 'md5sum'...

Change this line:
fileSum[${count}]=$( cat "${filePaths[${count}]}" "${filePaths[${count}]}/rsrc" | md5sum | cut -d' ' -f 1)

To:
fileSum[${count}]=$( cat "${filePaths[${count}]}" "${filePaths[${count}]}/rsrc" | md5 | cut -d'=' -f 2)

Essentially replacing md5sum utility with md5, then piping output to cut and changing the field delimiter to '=' rather than a space and selecting field 2 rather than 1.

Very pleased to hear that you are becoming familiar with bash, by the way. 🙂

Reply

Answer 11

Guy Burns Author

Level 2

367 points

May 19, 2010 8:47 PM in response to JulieJulieJulie

Back again. This is the Terminal output when run on my +Duplicates Copy+ folder:

Last login: Thu May 20 13:28:30 on ttyp1
Welcome to Darwin!
jenny-pearces-imac-g5:~ Jenny$ chmod +x /Users/Jenny/Desktop/Bash\ Test\ Documents/BashTest.txt
jenny-pearces-imac-g5:~ Jenny$ /Users/Jenny/Desktop/Bash\ Test\ Documents/BashTest.txt /Users/Jenny/Desktop/Duplicates\ copy/
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 1: ${filePaths${count}}: bad substitution
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 17: fileSum0=d41d8cd98f00b204e9800998ecf8427e: command not found
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 1: ${filePaths${count}}: bad substitution
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 17: fileSum1=d41d8cd98f00b204e9800998ecf8427e: command not found
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 1: ${filePaths${count}}: bad substitution
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 17: fileSum2=d41d8cd98f00b204e9800998ecf8427e: command not found
jenny-pearces-imac-g5:~ Jenny$

I'm not sure it's worth spending much more time on this. However, if you're prepared to keep posting, JJJ, I'll keep testing.

*SCRIPT AFTER THE EDIT*
#!/bin/bash

declare -i count=0
declare -a fileSum filePaths

# Main folder can be declared on the command line or if not then use /testpix
MainFolder="${1:-/testpix}"

# Set the internal field separator to newline to preserve spaces in file paths
IFS=$'\n'

# Use 'find' to create a list of files within the folders.
filePaths=( $( find "${MainFolder}" -type f \! -name ".*" ) )

# Get an MD5 checksum for each file's combined content of both data and resource forks
for file in ${filePaths[*]} ; do
fileSum${count}=$( cat "${filePaths${count}}" "${filePaths${count}}/rsrc" | md5 | cut -d'=' -f 2)
let count+=1
done

# For each file, check for a duplicate checksum and if found, move the matching file to the user's trash folder
# Rename files with duplicate names by appending #XX
for ((i=0;i<${count};i++)) ; do
[ -z "${filePaths[${i}]}" ] && continue
dupecount=1
for ((j=0;j<${count};j++)) ; do
[ -z "${filePaths[${j}]}" ] && continue
if [ "${fileSum[${i}]}" = "${fileSum[${j}]}" -a ${i} -ne ${j} ] ; then
mv "${filePaths[${j}]}" ~/.Trash && filePaths[${j}]='' && fileSum[${j}]=''
elif [ $(basename "${filePaths[${i}]}") = $(basename "${filePaths[${j}]}") -a ${i} -ne ${j} ] ; then
let dupecount+=1
dirname=$(dirname "${filePaths[${j}]}")
fullfilename=$(basename "${filePaths[${i}]}")
extension="${fullfilename##*.}"
filename="${fullfilename%.*}"
until [ ! -e "${dirname}/${filename} #${dupecount}.${extension}" ] ; do
let dupecount+=1
done
mv "${filePaths[${j}]}" "${dirname}/${filename} #${dupecount}.${extension}"
filePaths[${j}]="${dirname}/${filename} #${dupecount}.${extension}"
fi
done
done

exit 0

Reply

Answer 12

May 19, 2010 9:10 PM in response to Guy Burns

The number signs in your copy of the script have become outline style numbers ( 1. 2. 3. instead of # # #) This may indicate that your text editor is changing the content, or that pasting the script here is altering the content and I'm not sure which. I'll paste my copy of the script below.

EDIT: It's changing when pasted into this forum post.

It appears that the content of your script is also scrambled - judging from the output you posted.

OUTPUT
/Users/Jenny/Desktop/Bash Test Documents/BashTest.txt: line 1: ${filePaths${count}}: bad

Line 1 of the script should be the hashbang not the 19th line which is what your output is showing...

Here is my copy, zipped:
http://www.mediafire.com/?mzmmj0myuzt

It already has the execute bit set and proper unix line-endings etc. It's ready to run and can be opened in TextEdit to review the content.

Message was edited by: JulieJulieJulie

Reply

Answer 13

Guy Burns Author

Level 2

367 points

May 19, 2010 11:57 PM in response to JulieJulieJulie

The zipped version appears to work on the +Duplicates Copy+ folder. I'll generate a more complex test folder and let you know how it goes before I let it loose in the real world.

Reply

Answer 14

Guy Burns Author

Level 2

367 points

May 20, 2010 1:00 AM in response to Guy Burns

The script worked with a more complicated test folder. I generated a text document "A", put that in a folder and subfolders, then made a small change to the text and dumped that in subfolders. Then I duplicated that setup but changed the file names to "B". The final folder contained:

• 3 identical "A"s
• 2 changed "A"s
• 3 identical "B"s (identical to "A")
• 2 changed "B"s (identical to changed "A")

*RESULT FOR A*
The result of running the script was a folder with one "A" (one of the changed "A"s), and an "A#2" (one of the identical "A"s). Sent to the Trash was one of the changed "A"s and two of the identical "A"s.

QUES: What happened to A#1? Does the script start at #2?

*RESULT FOR B*
No "B"s were left in the folder, but there was one "B" in the Trash. The script appears to treat files with different names, but which are otherwise identical, as the same file and therefore removes the additional copies –– in this case, all the "B"s.

QUES: Why aren't there more "B"s in the Trash? What happened to them?

I wasn't expecting that identical files that had different names would be treated as the same file. That's a nice little touch actually, which removes even more unnecessary files than I was anticipating. But this feature might have a downside – some of my images changed names during the project, so if I went looking for one of them in the archive, I might not find it because of the name change.

It's now time for a test on a real folder, within which I'll bury a few subfolders containing identical files to see what happens.

Reply

Answer 15

May 20, 2010 2:03 AM in response to Guy Burns

QUES: What happened to A#1? Does the script start at #2?

Answer: Yes.

In the section which checks for duplicate checksums...
dupecount=1

Further into that section it checks for duplicate filenames...
let dupecount+=1

1+1=2

If you want it to start with #1 then change the "dupecount=1" to "dupecount=0". However, you may end up with:

A.txt
A #1.txt
A #2.txt

Whereas I prefer to think of A.txt as #1 and thus the next one should be #2.

QUES: Why aren't there more "B"s in the Trash? What happened to them?

Answer: When the script confirms that the content of a file is identical to the content of a prior file, the duplicate is moved to the trash. If the next duplicate moved to the trash has the same name, it will overwrite the existing file in the trash. In other words if B.txt in folder 2 and B.txt in folder 3 are both duplicates of B.txt in folder 1 then the file from folder 2 would be moved to the trash first and the file from folder 3 would overwrite the file from folder 2, resulting in only one file in the trash. This is because using the bash command mv to move a file to the trash circumvents the normal OS X GUI (Graphical User Interface) behavior of adding # 1, # 2 and so on to files with matching names.

I chose not to emulate that behavior in the script because the script is using checksums to verify that the file content is identical before moving the files to the trash and therefore overwriting the files there seemed acceptable to me.

For the last issue, the possibility that a file is an exact duplicate with a different name and you want to retain the existence of the file for the given project... my first thought is that the script could output the list of files and checksums into a text file. It would be easy to find the name of a missing duplicate image from within that file and which would then give you the MD5 checksum, which can be 'found' again in the file to show which other file had the same content even though the name had been different.

Reply