My experiences using Pkg.Artifacts for test data

I really like the Pkg.Artifacts mechanism but I found it a little bit difficult to figure out how it worked from the current documentation. To help others, I’ll mention a couple of the stumbling points along the way to help others use this really slick tool.

First, Artifacts kind of assume that your data files will be packaged in a Unix friendly bundle. This is sorta alluded to in the documentation but becomes clear only when you read the code. I found that my data compressed really well (64 MB => 2.5 MB) so I really didn’t want to download the uncompressed data. “tar.gz” works great which as a Windows person I used 7-Zip to create. (First, tar the files and then gzip them. It can’t be done in one step in 7Zip.) In one case, I had 100 separate data files and in the other two. I bundled each set into its own “tar.gz” file and uploaded them to Google Drive. The “tar.gz” files are automatically unbundled into the individual files after Artifacts downloads and verifies them.

Second, the link that Google Drive gives you to download the data assumes you are working in a browser. It will download but the hash changes every time. However, you can edit the URL so that the file downloads directly. For me changing
https://drive.google.com/open&id=1XpI4sAJe8pGn--sRolonCWdkadd67FRw
to
https://drive.google.com/uc?export=download&id=1XpI4sAJe8pGn--sRolonCWdkadd67FRw
makes all the difference.

Third, as best I can tell, the actual values of the “git-tree-sha1” in the Artifacts.jl file don’t actually matter. Any unique 40-character hex-string will work. They just determine the directory in which the artifact is stored.

Fourth, all you need is an Artifacts.toml file (next to your Projects.toml file). While there is a programmatic interface, it isn’t really necessary for simple use-cases like mine. My Artifacts.toml looks like this…

[shooter0]
git-tree-sha1 = "2ef1ba92161ff8021c162a0d1fe9468192f03909"
lazy = true
  [[shooter0.download]]
  sha256 = "D5F2F40778A3598A22394BEE65169E5D9BDABF2D57B64C15EF229C4273E71769"
  url = "https://drive.google.com/uc?export=download&id=1XpI4sAJe8pGn--sRolonCWdkadd67FRw"

The SHA256 hash is the hash of the tar.gz file.
Then to access the data is as simple as

data = artifact"shooter0"

The first time that the artifact macro is executed the data is downloaded from Google Drive, the SHA256 hash checked against the downloaded file, if they match the contents of the tar.gz file are extracted into the artifact directory which is returned and assigned to data.

Finally, subsequent calls just return the data path which is from that point on assumed to contain the correct files. If a file is deleted, the Artifact mechanism won’t notice and fix the problem.

Hope this helps a few people…

6 Likes

Cross-referencing this thread which also has some info for creating artifacts based on an existing tarball.

1 Like