How to reduce the large number of files in ~/.julia?

I am working on a shared cluster. The home directory of every user is limited not only in total size but also in the number of files.
I build julia from source on a different drive.

To be specific: I am running julia 1.3.1, 1.4.0, and 1.4.1, and I count about 130 000 (mostly small) files.
The garbage collector does not catch anything.

  • How could I reduce the number of files in the ~/.julia directory?
  • Is there a smart way to clean up leftover files from former times?
  • Is there a way to move this directory out of the users /home?
3 Likes

I would start with the last question: see ?DEPOT_PATH. You can put it anywhere.

The rest depends on details; you could see if some environments keep around multiple versions of the same package so that GC can’t free them. The biggest culprits for the number of files should be packages and artifacts.

That said, these days (with modern filesystems) limiting the number of files for a user is a bit unusual. Generally, requiring users to invest time into stuff like this is using hours of highly skilled labor to reclaim a few cents worth of storage.

4 Likes

You can try

using Pkg, Dates
Pkg.gc(; collect_delay=Second(0))

From what I have heard about the cluster at our university, maintaining this limit on the number of files provides a performance boost in HPC applications, and it’s not a question of storage space. I don’t know how this works exactly, but we do have to stick to strict quotas even when storage space isn’t running out.

So goes the argument. Space is not the issue.
Our cluster has a dedicated file system for many small and few large files. The users home directory is somewhere in between.

Thanks, this cleaned up some files.
However, only some hundreds, not the 10 000s I was looking for.

Running out of inodes is not uncommon in clusters when gazillions of files are produced in some jobs

2 Likes

@met-j you could do something really hacky like zip up the entire julia directory and make a julia command which unzips, runs, rezips. That way, the files just look like temp files from your scripts.

The biggest culprits are five: artifacts, clones, conda, packages and registries. Each list between 20 000 and 30 000 files.
These numbers remind be a bit of suckless philosophy and leave me wondering if it needs to be this way.
My current solution is to set JULIA_DEPOT_PATH to a NAS with out limitations on the number of files before starting julia.
At least on our cluster, not everyone has access to such a storage. Most users must live with smaller numbers of files.

Haha. It says “The more code lines you have removed, the more progress you have made. As the number of lines of code in your software shrinks, the more skilled you have become and the less your software sucks.”

The answer is then rm -rf *.

1 Like

It’s also common for users’ home directories on HPC clusters to be mounted via NFS, which positively CHOKES on small files. The new Pkg system helped immensely, but it was not uncommon to see File I/O timeouts pre 1.0. Now the limitation just manifests as slower-than-expected file access.

You can completely remove clones at least. How is the distribution of files among the left over directories?

The regstry itself is only needed for Pkg operations and not to run anything.

1 Like

One option is to use containers: Singularity is pretty common on many HPC clusters, and would appear to the file system as a single file.

4 Likes

I have a similar number of files in my ~/.julia directory, so I’ll assume it’s a pretty similar situation. I deleted clones which got rid of some 6,000 files. What remains is broken down like this:

directory files
artifacts 74759
compiled 1120
environments 8
logs 4
packages 26575
prefs 3
registries 25990
servers 7

So artifacts are the major culprit, followed by packages and registries. There’s not much we can do about artifacts: these contain libraries which need to be individual files arranged in a certain way in order to work. Packages and registries we could potentially do something about.

For packages, we’d need to teach Julia how to load code from a tarball, but that might break things since some packages are probably designed to load data and such from the source tree as well. It would probably be easier and more reliable to just make packages appear to be in a read-only file system that’s actually backed by tarballs.

Registries are the easiest since their pure data: we could teach Pkg to load registry info directly from a tarball and that would “just work”.

Since the biggest source of file count is not fixable, however, it does suggest that figuring out a way to not store the ~/.julia directory on a file system that has trouble with lots of files would be the most effective way to address your particular issue.

3 Likes

Note that all artifacts and packages that don’t have a deps/build.jl file should be relocatable, so if you move those to a different files system (which can be read-only), then that would eliminate a lot of the files on this file system. You’ll still be able to install and modify ~/.julia and it will grow over time if you install new packages, but you can do the same thing again at some later time to reduce the file count again.

Perhaps this would help with the time it takes on Windows to process the registry? On Windows I believe it takes an inordinate amount of time because of the antivirus getting involved with each of the thousands of files.

1 Like

Probably.

1 Like

artifacts in my view is a big problem. Very often artifacts come with tons of files that are not needed at all for its julia usage (.lib, .a, .exe, include/*.h, etc …).

And while I’m at this, why on Windows do we have to have all those packages/Xorg_xxx?

You don’t have the artifacts for them.

Being a user of various HPC clusters, I am familiar with the argument, but I don’t think there is compelling technological reason for this. It’s just that some cluster architectures were set up without anticipating the need for many small files, and upgrades can be disruptive.

I agree with @simonbyrne that containers are the best short-run workaround (with some other, even more important, advantages), but in the long run it is worth asking for fixing this with your cluster admins. Most clusters are get overhauled pretty frequently (every few years).

In any case, I am not sure this is something that Julia should strive to fix by replicating some sort of a filesystem (a tarball, or SquashFS).

1 Like