[ANN] OhMyArtifacts.jl

I’m excited to announce OhMyArtifacts.jl, a dynamic artifact system that lives in scratchspace. It has API similar to Artifacts.jl, but it use sha256 hash and cache every files. The usage of each cache is tracked, so it can also remove unused cache automatically.

Here is the iris example with OhMyArtifacts:

julia> using OhMyArtifacts
[ Info: Precompiling OhMyArtifacts [cf8be1f4-309d-442e-839d-29d2a0af6cb7]

# Register and get the Artifacts.toml
julia> myartifacts_toml = @my_artifacts_toml!();

# Query the Artifacts.toml for the hash bound to "iris"
julia> iris_hash = my_artifact_hash("iris", myartifacts_toml)

# If not bound
julia> if isnothing(iris_hash)
           iris_hash = create_my_artifact() do working_dir
               iris_url_base = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris"
               download("$iris_url_base/iris.data", joinpath(working_dir, "iris.csv"))
               download("$iris_url_base/bezdekIris.data", joinpath(working_dir, "bezdekIris.csv"))
               download("$iris_url_base/iris.names", joinpath(working_dir, "iris.names"))
               # explicitly return the path
               return working_dir
           end
           bind_my_artifact!(myartifacts_toml, "iris", iris_hash)
       end

julia> iris_hash
SHA256("83c1aca5f0e9d222dee51861b3def4e789e57b17b035099570c54b51182853d4")

julia> my_artifact_exists(iris_hash)
true

# Get the artifact path
julia> iris_dataset_path = my_artifact_path(iris_hash);

julia> readdir(iris_dataset_path)
3-element Vector{String}:
 "bezdekIris.csv"
 "iris.csv"
 "iris.names"

julia> readline(joinpath(iris_dataset_path, "iris.names"))
"1. Title: Iris Plants Database"

# Every subfile is a symlink
julia> all(islink, readdir(iris_dataset_path, join=true))
true

julia> iris_name_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names";

# Helper function that combine create and bind
julia> iris_name_hash = download_my_artifact!(Base.download, iris_name_url, "iris.names", myartifacts_toml)
SHA256("38043f885d7c8cfb6d2cec61020b9bc6946c5856aadad493772ee212ef5ac891")

# Same value
julia> readline(my_artifact_path(iris_name_hash))
"1. Title: Iris Plants Database"

# Same file
julia> readlink(joinpath(iris_dataset_path, "iris.names")) == my_artifact_path(iris_name_hash)
true

# Unbind iris dataset
julia> unbind_my_artifact!(myartifacts_toml, "iris")

julia> using Dates

# Recycle: "iris/iris.names" is also used by "iris.names", only
#  remove 2 file ("iris/iris.csv", "iris/bezdekIris.csv") and 1 folder ("iris")
julia> OhMyArtifacts.find_orphanages(; collect_delay=Hour(0))
[ Info: 3 MyArtifacts deleted (24.889 KiB)

# "iris.names" still exists
julia> my_artifact_exists(iris_name_hash)
true

julia> readline(my_artifact_path(iris_name_hash))
"1. Title: Iris Plants Database"

# Iris dataset is removed
julia> my_artifact_exists(iris_hash)
false

julia> isdir(iris_dataset_path)
false

# Unbind and recycle
julia> unbind_my_artifact!(myartifacts_toml, "iris.names")

# When `using OhMyArtifacts`, this function is called if we haven't do it for 7 days, so
#  geneally we don't need to manually call it.
julia> OhMyArtifacts.find_orphanages(; collect_delay=Hour(0))
[ Info: 1 MyArtifact deleted (10.928 KiB)
12 Likes

What’s this for? Why would you use this instead of official artifacts?

1 Like

The Artifacts.jl stdlib is static, you can’t generate new artifact on the fly.

1 Like

For those of us who have yet to meet artifacts directly.
would you elaborate on their what why when, and
how best to apply your work?

5 Likes

The Artifacts.jl stdlib is designed for managing the artifacts automatically. For example, JLL package have an Artifacts.toml that specify the download url, sha1 tree hash, and corresponding platform of each prebuilt shared library. Artifacts.jl would download those files and manage them, so you don’t need to worry if two JLL package use the same shared library then you have two copy of that library on disk, or those library remain on disk if the JLL package are removed. The small problem is that the Artifacts.toml have to be created and freezed before you can ship the package. That works perfectly fine for JLL package, but if you need to modified the Artifacts.toml in runtime, that problem appears.

Then there is a Scratch.jl that provide a runtime read/write-able disk space for each package. The whole space is managed, but everything inside that space need to be managed manually.

So OhMyArtifacts simply implement the Artifact.jl idea in the space Scratch.jl provided. Identify files by hash, track the usage, so no worrys about duplicated files or garbage on disk.

10 Likes

If i create a package that gives the user the option to download a prebuild sysimage to reduce the package latency, would OhMyArtifacts be a good choice for this use case?

I might be wrong, but I don’t think we could switch sysimage in runtime. Then neither Artifacts nor OhMyArtifacts would fit that use.

I am really confused about how your packages compares to directly use the scratch space or build.jl, is this an easier way to interact with scratch space?

Also, how does your packges hash algorithm compare artifacts? Official artifacts are compared by a unqiue hash (tree hash for official artifacts, but could actually be anything) which is hardcoded in the corresponding JLL package.

I’m also using unique hash. Artifacts.jl use the sha1 tree hash of the directory. OhMyArtifacts.jl use sha256 hash. Depends on the type of artifact. it either hash the content of that file or perform tree hash on the directory.

The difference is that we don’t/can’t hardcode the unique hash in the TOML. Every hash is computed in runtime and store in the TOML. And the unit of Artifacts.jl is a directory, that means there could be duplicate files in two artifacts. OTOH, OhMyArtifacts.jl hash every single file in a directory, so that won’t happen (but we spend more time on tracking the usage of every single file, that’s a trade-off). So it’s more like a custom cache system than a artifact system. I use that name simply because of borrowing the idea and API from Artifacts.jl.

The purpose of build.jl is different, so won’t talk about it. For directly use the scratch space, the only difference is you have to maintain the path and existence manually, and also make sure every file is unique (if you do care about that). Imagine you have a package with 2 version, both of them are using the same scratch space. How do you know if an artifact is been used by another version or not? Can you delete them savely? What would happen if two version are generate the same file at the same time? Sure you can definately make them use two different scratch space, then now you have duplicate files. So these are the problems that OhMyArtifacts.jl try to solve.

EDIT: I just realize the package with 2 version scenario won’t work directly because they have the same uuid, I would need to make a patch :stuck_out_tongue:. But you get what I mean.

1 Like

Thanks for your reply!

However, I am also interested in how OhMyArtifacts identifies whether two artifacts are the same. For official artifacts, they are identified by the hash key, which is shipped along with package code. How does your package do that, do I need to ship the key along with source code, or find a place to persistently store it so it can be retrieved later?

Also, you said that all files are cached, does it mean that the directory structure may not be preserved?

That’s the Neat Part, You Don’t.
Joking aside, That’s why I said it is more like a custom cache system and we can’t hardcode the hash. It’s made for caching artifacts generated at runtime and compute hash directly on the generated files. You could hardcode the hash in your package, but probably it would only be using as a checker for the correctness of generated files or for the existence then run the generation function if needed.

In Artifacts.jl they have urls shipped along with the package, so they check whether the artifacts is already exist with the hardcode hash and run downloading if needed. But in OhMyArtifacts.jl, the creation of the artifacts is done by the downstream packages, so I think it might be reasonable not to ship the hash with OhMyArtifacts.jl but with the downstream packages itself.

In the newest release (v0.3), the directory structure is preserved by creating a shadow folder. The shadow folder copy the directory structure of the original directory and replace every file with a symbolic link that points to the real file in the cache. Like in the iris example, the iris_dataset_path is a shadow folder and every file inside it is a symlink.

1 Like

Thanks! I am starting to understand.

Julia Artifacts system is aimed to avoid building artifacts locally, where users submit a recipe to Yggdrasil and get the artifact in the form of a JLL package. The exact content delivered, indexed by git tree hash, not only depends on package name (UUID) and package version, but also on the target platform.

OhMyArtifacts.jl is aimed to provide easily an easily managed cache of locally built artifacts. The build step is done on the users machine, and then cached and potentially shared by other packages. By caching it avoids repetitively building the same artifact every time the package is loaded, while by sharing it saves disk space.

Regarding directory structure, symlink may have some surprising behaviors on Windows, does your package have or plan to have options to turn it off for specific packages.

3 Likes

What kind of surprising behaviors would happen on Windows? The CI seems to pass on Windows, yet I didn’t test reading on symlink but only the construction. Currently I don’t have a plan on supporting that.

If you create a symlink to julia.exe, then it will not run because its dependency is searched in the context of the symlink instead of the source file. For example, if julia.exe (or any executable that ships dll on the same directory as its executable) is not on your PATH, you can create a symlink to julia.exe in cmd:

mklink julia.exe full/path/to/julia.exe
./julia.exe 

The last line won’t work because its dependency i.e. dlls are not available.

What if you also make symlink to those dlls in the current path?

Cool! It’s great to see people working in this space, there are lots of interesting problems to be solved.

It’s made for caching artifacts generated at runtime…

This is similar in some ways to an idea I’ve used a few times that I usually called “Mutable Artifacts”. The only existing implementation of this that I know of is in Gtk.jl. We create a new Artifacts.toml file that stores the cached result of an expensive operation (in this case, building the gdk-pixbuf-loader cache), and use the presence of that key within the MutableArtifacts.toml file as a signal as to whether or not the loader cache has been built previously or not. Note that this code segment predates scratch spaces, which is why it stores the MutableArtifacts.toml in the package directory. It really should be updated to instead store the MutableArtifacts.toml file itself in a scratchspace.

…and compute hash directly on the generated files.

Do you mean that you hash each individual file within a directory, or that you just support hashing a single file directly, ignoring its filename? When designing Artifacts.jl, I initially struggled with the idea of including single files, because if you identify them purely by a content hash, the filename could change without the content hash knowing. This could cause problems when distributing artifacts that depend on eachother (such as in the case of JLL packages) so I decided to limit artifacts to only deal on a directory basis; e.g. an artifact is a container that holds files, not a file itself. I believe there is room for improvement here!

One thing that I think is worth bearing in mind when comparing Artifacts.jl to this system is that Artifacts.jl makes a lot of choices with the assumption that you eventually want to ship an artifact to another machine. Operating on directories instead of files, identifying everything by hash instead of by name, using git treehash semantics, etc… these are all decisions rooted in the idea that we are eventually going to need to efficiently cache these files (identifying by hash makes this easy), unpack them on disk in a consistent manner (git treehash semantics ensure that whatever makes it onto disk is exactly the same as what was packed up), etc…

While it is theoretically possible to implement most of the functionality you’ve said so far by using the plain Artifacts API within an Artifacts.toml that is stored in a scratchspace, I actually don’t think it makes sense to use our API if you’re not planning on distributing the artifacts. Some other API probably makes more sense. In this case, I think what you’re trying to do is to combine the idea of scratchspaces with a de-duplicating block storage, which uses SHA256 hashes to detect files that are identical, and doesn’t store them twice. This is kind of interesting, and I think separating yourself from the Artifacts API might actually make it easier to get something that is ergonomic and fun to use, since things like the underlying hash and whatnot aren’t really useful for you here. They’re all just an implementation detail of the deduplicator, I think.

4 Likes

Yes, this is exactly the idea of OhMyArtifacts. The origin of this package is something similar to the “Mutable Artifacts” that I implemented in Transformers.jl for managing the downloaded models, which also predates scratch space and suffers from the read-only change of package folder. And that’s why the api is mimicking Artifacts.jl’s api. This could be changed, but I haven’t thought of or encountered something that need more than that. I would be interested if you have any idea on improving it.

This might best be illustrated with the example. So after running part of the iris example above, the storage would look like this:

julia> cd(OhMyArtifacts.get_artifacts_dir())

julia> run(`tree .`)
.
├── 32
│   └── 32fefb84e05232696cda7de74c34fe56bcfab1e07415ebde5a148c55063c402a
├── 38
│   └── 38043f885d7c8cfb6d2cec61020b9bc6946c5856aadad493772ee212ef5ac891
├── 83
│   └── 83c1aca5f0e9d222dee51861b3def4e789e57b17b035099570c54b51182853d4
│       ├── bezdekIris.csv -> /home/peter/.julia/scratchspaces/cf8be1f4-309d-442e-839d-29d2a0af6cb7/0.3/artifacts/f7/f7e1c
1bddf54ded0d06b264d7f260d96e5157592ab17dc0d5fdb890d0b243d8b
│       ├── iris.csv -> /home/peter/.julia/scratchspaces/cf8be1f4-309d-442e-839d-29d2a0af6cb7/0.3/artifacts/32/32fefb84e05
232696cda7de74c34fe56bcfab1e07415ebde5a148c55063c402a
│       └── iris.names -> /home/peter/.julia/scratchspaces/cf8be1f4-309d-442e-839d-29d2a0af6cb7/0.3/artifacts/38/38043f885
d7c8cfb6d2cec61020b9bc6946c5856aadad493772ee212ef5ac891
└── f7
    └── f7e1c1bddf54ded0d06b264d7f260d96e5157592ab17dc0d5fdb890d0b243d8b

I made a detection in the create_my_artifact on whether the returned path is pointing to a file or directory. If it is a file, I hash the content and ignore the filename. OTOH, if its is a directory, I first hash the whole directory. So in the example, you can see a “83c1aca5f0e9d222dee51861b3…” that is actually the tree hash of the (origin) directory. Then I replace everything with a symlink pointing to the real file with abspath. So if the directory structure is important, it can also be preserved.

1 Like