Working With Many Files in a Folder

Question

Level 1

22 points

Working With Many Files in a Folder

Hello everyone,

I'm working with many small data files, most of which are about 2 KB. They are organized in folders. The folders contain anywhere from 100,000 to 300,000 files. As my project progress, there may be a folder containing 7 million.

I know that the OS X filing system has a high upper limit to the number of files that a folder can contain, but I would like to know whether it would better if I broke up the folders into subfolders containing, say, not more than 25,000 files each. The reason I ask is that Finder seems to get bogged down with more than 50,000 files in a folder. When I open such a folder, there are no files shown in it and no folder size or number of files shown in the Get Info window for the folder. It takes a few minutes of the spinning progress wheel before the file icons appear and I can see what is in there.

Anyone out there work with many files and has tips they would share?

Greg

MacBook Pro (Retina, 13-inch,Early 2015), macOS Sierra (10.12.4)

Posted on Apr 21, 2017 9:17 AM

Reply

Answer 1

BobHarris

Level 9

53,531 points

Apr 21, 2017 8:06 PM in response to Lypny

If you are going to have huge directories, then avoiding the use of the Finder to look at them will make things faster, as the Finder is going to always look at various file attributes, which is going to add extra disk seeks for every file in the directory. If you have an SSD, this might not be too much of a factor, but if you are using a rotating disk, it is going to hurt. Millions of files in a folder is going to hurt a lot.

You can cut down on some of the extra Finder work (not all, but some), but telling the Finder to not show Icon Preview

If you can come up with a method of spiting the files into manageable sizes based on names, when created, etc... that will work for your workflow, that would be helpful. For example:

Year2017 -> Month04 -> Day07 -> Hour05

Assuming you are creating files on a periodic basis and at such a rate that it is best to go to per hour. If per day is sufficient, then just go to the day level.

If date based separation is not going to work, then name based

a -> aa -> aaa

a -> ab -> aba

a -> ac -> aca

...

b -> ba -> baa

b -> bb -> bba

b -> bc -> bca

...

etc...

Scripts can be written to move files into some directory structure that works. It just depends on what structure you feel will work to keep the files into manageable Folder sizes.

If ou are going to switch over to using the Terminal, the VikingOSX suggestions and my additions will help, but still millions of files in a single Folder is going to make you wait. But you can get away with larger Folders accessing them from a Terminal session.

Reply

Answer 2

Apr 22, 2017 6:28 AM in response to BobHarris

Bob,

Thanks for pointing out the efficiency and sort suppression of the -f flag. The side-effect of this flag is that it implicitly forces the -a flag, and now file counts are off by the inclusion of at least the current directory (.), the parent directory (..) and the .DS_Store output lines that are presented to the wc utility.

If we were using GNU ls, then the -I ".*" syntax could be used to suppress the dot files without having to resort to losing the beauty pagent with the following syntax that correctly reports the file count while using the -f flag. The following also expands any sub-directories, and the use of the -d flag suppresses including their dot files too. We remain in the current directory, while peering into another directory's contents. The 20 count below is the accurate file count and includes a single directory located in the ~/Desktop/Files location.

$ shopt -s extglob

# The sed '$'d trims the last line of output which is the sub-shell reporting the Session saved.

$ (cd ~/Desktop/Files;ls -1fd !(.*) | sed '$'d) | wc -l | sed 's/^[ \t]*//'

$ shopt -u extglob

Reply

Answer 3

BobHarris

Level 9

53,531 points

Apr 22, 2017 7:12 PM in response to Lypny

The appeal of Finder is that it is so natural to open a file, even one chosen at random, by double-clicking it, to see its contents and figure out where my code has gone wonky

Many times

open filename.extension

will automatically open the correct application, as if you had double clicked on it.

If you need to force a specific application, and default for the .extension is not that app, then you would include the -a Application option, as in:

open -a LibraOffice filename.extension

You said

I'm doing text analysis and extracting specific bits of information from files that all contain the same info but are not all formatted the same way)

If the files are plain text or at least the strings you are looking for are plain text, then you can use lots of different tools in the terminal to search the files. 'grep' is a great tool for searching a large number of files to look for unique strings.

And because you have a very large number of files that may overwhelm the command line length limits, you could do something like:

Type 'cd' and a space

Drag and drop the folder icon from the Finder to the Terminal session

Press return.

The terminal session's current working directory, is the Folder you are interested in.

Now:

ls -f | grep -i 'filename_pattern_you_are_interested in' | xargs grep -i -l 'pattern_in_files_you_are_looking_for'

This command will list the names of all the files that contain the pattern you are looking for.
The 1st 'grep' will select the subset of files that have a specific pattern in the file name.
The 2nd 'grep' will search the contents of each file looking for the pattern you are interested in.
The 'xargs' command will invoke the 2nd 'grep' multiple times, each time with a subset of the file names it reads from its side of the pipe, but not so many that the command line gets too long. It is a very efficient way of processing lots of files without exceeding the command line length
The -l (dash lowercase L) in the 2nd 'grep' tells 'grep' to just show the file name if it contains the pattern in the file. If you want to actually see what lines it found in the file that match your pattern, then remove the -l
The -i in both 'grep's says to ignore upper/lowercase. Often makes it easier to find things.

Grep accept regular expressions. If you do not understand regular expressions, then know that just using simple a-z, 0-9 is good enough for many things.

NOTE: Most people with a lot less files in a directory would actually just use:

grep -i -l 'search_pattern' company*

My example line above is only because you have so many files in a single directory, that wildcard file name expansion would most likely put so many files on the command line, it would overflow the maximum length allowed on a command line.

Now if you know things like perl, awk, python, or other scripting language that is good at searching files for things, you can have a fantastic time using Terminal sessions to process your files.

Reply

Answer 4

Apr 21, 2017 2:30 PM in response to Lypny

In the Terminal, you can type the blue text:

# Terminal prompts for illustration. Yours may be different. A '~', or $HOME is shorthand

# for your home directory (e.g. ~/Desktop/afolder). No white-space in file/folder names, or

# you must escape these with ~/Desktop/"a folder", or ~/Desktop/a\ folder.

#

# get the count of files in the directory, ignoring dot files. Strip leading spaces on result.

$ ls -1 ~/Documents | wc -l | sed -e 's/^[ \t]*//'

19

$ filecnt=$(ls -1 ~/Documents | wc -l | sed -e 's/^[ \t]*//')

$ echo $filecnt

19

# get the folder hierarchy size in kilobytes. Get just the first field value. See man du.

$ du -hks ~/Documents | cut -f1

41716

# show the first 25 filenames in the folder. A trailing '/' indicates a subfolder.

$ ls -1p ~/Documents | head -n 25 | more

# you can see the manual page for each of these commands

$ man ls

$ man du

Reply

Answer 5

BobHarris

Level 9

53,531 points

Apr 21, 2017 7:52 PM in response to VikingOSX

When dealing with huge directories, if you do not need the file names in sorted order, it is best to include the -f option

-f Output is not sorted. This option turns on the -a option.

so ls -1, would be ls -1f

The ls -1p ~/Documents will sort alphabetically, and if you want the first 25 file alphabetical files that would be fine.

If you want to see the first 25 files as stored in the directory include -f (no sorting)

If you want to see the most recently created files by date, include -t

if you want to see the oldest by date, include -tr

if you want to see the largest files first include -S, and -Ss will show the actual allocation, and -Sl will show in bytes

if you want the smallest files first, then -Sr, or -Ssr or -Slr

But with millions of files, if you can avoid sorting, it will speed things up. That is to say just get the names, do not sort the names, especially DO NOT ask for attributes, such as dates, sizes, permissions, ownership, etc.... And if you are going to pipe the output to a command that is going to summarize, such as 'wc', then it is just a waste of CPU to sort the files or get extra information about the file.

Reply

Answer 6

BobHarris

Level 9

53,531 points

Apr 22, 2017 7:49 AM in response to VikingOSX

I understand the off-by-3 because -a is enabled, but really if Lypny is going to be using Folders with 100,000, 300,000 and up to 7 million files, I do not think being off-by-3 for . and .. and .DS_Store is really going to matter all that much 🙂

But it could more efficiently be filtered out by a grep -v '^\.' in the pipeline if a truly accurate count is required.

The following also expands any sub-directories

As soon as you start looking at the file type, so you can see which are directories, you are increasing the time it takes to do the count, as now the 'ls' command has to stat() each file, which is going to cause the disk to seek to a different location for each and ever file.

Also the !(.*) with 7 million files is going to blow-out the command line length limit. Depending on the length of all the filenames, 100,000 or 300,000 might also blow-out the command line length limit. (The things I learn working on Unix file systems for a living 🙂 ).

Reply

Answer 7

Lypny Author

Level 1

22 points

Apr 22, 2017 2:06 PM in response to VikingOSX

Hi Bob Harris and VikingOSX,

Thanks to both of you for your helpful comments. Based on your deliberation, I think I can use both. With Terminal, the inclusion or exclusion of a few hidden files is a not big deal because I am working with so many files and because I have to run the file names through an algorithm to find out which I have missed in any given big batch of downloads. It is quite normal to miss upwards of 2,000 files when downloading a batch of 500,000. I just go back and try to grab the remaining 2,000. Terminal will be useful in allowing me to double-check the total number of files in each downloaded batch against what my algorithm tells me.

The appeal of Finder is that it is so natural to open a file, even one chosen at random, by double-clicking it, to see its contents and figure out where my code has gone wonky (I'm doing text analysis and extracting specific bits of information from files that all contain the same info but are not all formatted the same way). As for creating smaller subfolders of the files that Finder can handle without slowing down, I guess I am lucky that I adopted a strict naming convention for them. Each file has a unique identifier for a person or company and a date. I think that subfolders based on those two dimensions would contain about 25,000 files each, which shouldn't tax Finder too much. The only bit of extra work is in creating the subfolders and revising my text processing algorithms to crawl through them.

Thanks once again for your thoughtful insights,

Greg

Reply

Answer 8

Tony T1

Level 6

10,232 points

Apr 22, 2017 3:42 PM in response to Lypny

Lypny wrote:

The appeal of Finder is that it is so natural to open a file, even one chosen at random, by double-clicking it, to see its contents and figure out where my code has gone wonky

You can do that in Terminal with open -a <application>

(i.e. open -a Xcode NSString.h)

Reply

Answer 9

Tony T1

Level 6

10,232 points

Apr 21, 2017 9:39 AM in response to Lypny

Do you need to list the files in Finder?

Could you work in Terminal?

Reply

Answer 10

Lypny Author

Level 1

22 points

Apr 21, 2017 9:39 AM in response to Tony T1

I imagine I could to get basic info like the number of files in a folder, maybe a folder's size, and a partial listing to check some of the file names. I guess I'd have to brush up on Unix commands.

Reply