How to reduce the large number of files in ~/.julia?

met-j · April 23, 2020, 1:43pm

I am working on a shared cluster. The home directory of every user is limited not only in total size but also in the number of files.
I build julia from source on a different drive.

To be specific: I am running julia 1.3.1, 1.4.0, and 1.4.1, and I count about 130 000 (mostly small) files.
The garbage collector does not catch anything.

How could I reduce the number of files in the ~/.julia directory?
Is there a smart way to clean up leftover files from former times?
Is there a way to move this directory out of the users /home?

Tamas_Papp · April 23, 2020, 2:07pm

I would start with the last question: see ?DEPOT_PATH. You can put it anywhere.

The rest depends on details; you could see if some environments keep around multiple versions of the same package so that GC can’t free them. The biggest culprits for the number of files should be packages and artifacts.

That said, these days (with modern filesystems) limiting the number of files for a user is a bit unusual. Generally, requiring users to invest time into stuff like this is using hours of highly skilled labor to reclaim a few cents worth of storage.

kristoffer.carlsson · April 23, 2020, 2:10pm

You can try

using Pkg, Dates
Pkg.gc(; collect_delay=Second(0))

jishnub · April 23, 2020, 2:15pm

From what I have heard about the cluster at our university, maintaining this limit on the number of files provides a performance boost in HPC applications, and it’s not a question of storage space. I don’t know how this works exactly, but we do have to stick to strict quotas even when storage space isn’t running out.

met-j · April 23, 2020, 2:18pm

So goes the argument. Space is not the issue.
Our cluster has a dedicated file system for many small and few large files. The users home directory is somewhere in between.

met-j · April 23, 2020, 2:22pm

Thanks, this cleaned up some files.
However, only some hundreds, not the 10 000s I was looking for.

giordano · April 23, 2020, 2:37pm

Running out of inodes is not uncommon in clusters when gazillions of files are produced in some jobs

Oscar_Smith · April 23, 2020, 2:43pm

@met-j you could do something really hacky like zip up the entire julia directory and make a julia command which unzips, runs, rezips. That way, the files just look like temp files from your scripts.

met-j · April 23, 2020, 2:56pm

The biggest culprits are five: artifacts, clones, conda, packages and registries. Each list between 20 000 and 30 000 files.
These numbers remind be a bit of suckless philosophy and leave me wondering if it needs to be this way.
My current solution is to set JULIA_DEPOT_PATH to a NAS with out limitations on the number of files before starting julia.
At least on our cluster, not everyone has access to such a storage. Most users must live with smaller numbers of files.

PetrKryslUCSD · April 23, 2020, 3:16pm

Haha. It says “The more code lines you have removed, the more progress you have made. As the number of lines of code in your software shrinks, the more skilled you have become and the less your software sucks.”

The answer is then rm -rf *.

anon94023334 · April 23, 2020, 3:53pm

It’s also common for users’ home directories on HPC clusters to be mounted via NFS, which positively CHOKES on small files. The new Pkg system helped immensely, but it was not uncommon to see File I/O timeouts pre 1.0. Now the limitation just manifests as slower-than-expected file access.

kristoffer.carlsson · April 23, 2020, 4:30pm

You can completely remove clones at least. How is the distribution of files among the left over directories?

The regstry itself is only needed for Pkg operations and not to run anything.

simonbyrne · April 23, 2020, 7:33pm

One option is to use containers: Singularity is pretty common on many HPC clusters, and would appear to the file system as a single file.

StefanKarpinski · April 23, 2020, 7:42pm

I have a similar number of files in my ~/.julia directory, so I’ll assume it’s a pretty similar situation. I deleted clones which got rid of some 6,000 files. What remains is broken down like this:

directory	files
artifacts	74759
compiled	1120
environments	8
logs	4
packages	26575
prefs	3
registries	25990
servers	7

So artifacts are the major culprit, followed by packages and registries. There’s not much we can do about artifacts: these contain libraries which need to be individual files arranged in a certain way in order to work. Packages and registries we could potentially do something about.

For packages, we’d need to teach Julia how to load code from a tarball, but that might break things since some packages are probably designed to load data and such from the source tree as well. It would probably be easier and more reliable to just make packages appear to be in a read-only file system that’s actually backed by tarballs.

Registries are the easiest since their pure data: we could teach Pkg to load registry info directly from a tarball and that would “just work”.

Since the biggest source of file count is not fixable, however, it does suggest that figuring out a way to not store the ~/.julia directory on a file system that has trouble with lots of files would be the most effective way to address your particular issue.

StefanKarpinski · April 23, 2020, 7:47pm

Note that all artifacts and packages that don’t have a deps/build.jl file should be relocatable, so if you move those to a different files system (which can be read-only), then that would eliminate a lot of the files on this file system. You’ll still be able to install and modify ~/.julia and it will grow over time if you install new packages, but you can do the same thing again at some later time to reduce the file count again.

PetrKryslUCSD · April 23, 2020, 7:55pm

Perhaps this would help with the time it takes on Windows to process the registry? On Windows I believe it takes an inordinate amount of time because of the antivirus getting involved with each of the thousands of files.

StefanKarpinski · April 23, 2020, 8:12pm

Probably.

joa-quim · April 23, 2020, 9:55pm

artifacts in my view is a big problem. Very often artifacts come with tons of files that are not needed at all for its julia usage (.lib, .a, .exe, include/*.h, etc …).

And while I’m at this, why on Windows do we have to have all those packages/Xorg_xxx?

giordano · April 23, 2020, 9:58pm

You don’t have the artifacts for them.

Tamas_Papp · April 24, 2020, 5:22am

Being a user of various HPC clusters, I am familiar with the argument, but I don’t think there is compelling technological reason for this. It’s just that some cluster architectures were set up without anticipating the need for many small files, and upgrades can be disruptive.

I agree with @simonbyrne that containers are the best short-run workaround (with some other, even more important, advantages), but in the long run it is worth asking for fixing this with your cluster admins. Most clusters are get overhauled pretty frequently (every few years).

In any case, I am not sure this is something that Julia should strive to fix by replicating some sort of a filesystem (a tarball, or SquashFS).

Topic		Replies	Views
Cleanup of ~/.julia? General Usage	5	1821	August 28, 2018
Which files or subdirectories in `~/.julia/` can I delete? General Usage question	21	1804	May 23, 2023
Package dir too large General Usage question , package	2	721	February 14, 2017
Using Julia on campus computers w cloud file storage General Usage	4	1261	August 1, 2017
Windows - a moan about deleting Julia Offtopic windows , juliaup	16	1921	February 4, 2022

How to reduce the large number of files in ~/.julia?

Related topics