A couple of questions about Artifacts

Coming from an R background where it is common to have data sets included with packages I have been thinking about similar capabilities for Julia packages. One approach would be to use Artifacts as described in Pkg + BinaryBuilder -- The Next Generation. However, despite the fact that the Iris data is used as an example in that blog, I’m still a little foggy about some of the concepts and how they would apply to “data packages”.

One source of confusion for me is that name git-tree-sha. I now believe that what is labelled git-tree-sha1 in an Artifacts.toml file doesn’t really have anything to do with git. It is just a label and a directory name in what is usually ~/.julia/artifacts/. It that correct? As Tina Turner said, I couldn’t figure out what git has to do with it.

I expect what I would do is create a set of feather files and keep them all within one directory which would be the artifact. I don’t think I would create one artifact per data table (feather file) but instead put a number of data files in one artifact.

I’m still trying to comprehend the overall structure for Artifacts. Is the code related to the iris data in that blog posting to be part of a package, say Iris_data, that will then be a dependency for the package that uses the data?

1 Like

The git tree hash is a hash of the content of the tarballs. But this is not something that you have to do yourself, bind_artifact! should take care care of everything.

If you want to see a package using Artifacts for data, have a look at ObjectDetector.jl. The file Artifacts.toml has been created with the dev/artifacts/generate_artifacts.jl script.

1 Like

First I have to say that I am very happy with the recent progress in creating portable applications with artifacts + package compiler!

I am wondering if it is possible to put an artifact in the package repo, e.g. under the data directory. Then, in the Artifact.toml, you would refer to data/myfile. This seems to me considerably easier than what is done in dev/artifacts/generate_artifacts.jl . The benefit of doing this is that the resulting code would be transportable, which would not be the case when we would use @__DIR__ to refer to these files.

The reason the artifacts code works when relocated is because Julia has special code to find Pkg directories at runtime, and all paths are based off of that. The artifacts code doesn’t magically solve the issue of @__DIR__ not working properly at runtime when a precompiled package has been transported. Artifacts do not have any idea where your package code is, and won’t be able to find something like data/myfile any better than any other piece of Julia code.

If you want to do something like what Pkg does to find its depots, you could have something like the following in your package’s __init__():

using Pkg

# This will be re-initialized at __init__() time.
pkg_dir = @__DIR__

function __init__()
    # Get the current manifest, look up our own package by UUID
    env = Pkg.Types.Context().env
    pkg_uuid = Pkg.Types.UUID("12aac903-9f7c-5d81-afc2-d9565ea332ae")
    entry = env.manifest[pkg_uuid]

   # Convert the PkgEntry into a PackageSpec
    spec = Pkg.Types.PackageSpec(
        name=entry.name,
        uuid=pkg_uuid,
        version=entry.version,
        path=entry.path,
        repo=entry.repo,
        tree_hash=entry.tree_hash,
    )

    # Ask `Pkg` to find our source path
    global pkg_dir = Pkg.Operations.source_path(spec)
end

So as long as the package is still in its “correct” location (relative to the julia depot) at runtime, this should reconstruct the location to the source. I haven’t tested this thoroughly, but this is the best idea I have for now for how to solve your issue. :slight_smile:

2 Likes

Thanks for your quick response!

My package currently has a setup similar to that of XLSX.jl. It has a data directory that contains empty Excel files as templates. These are my binary files; they are currently loaded by using the absolute file path derived using @__DIR__:

const EMPTY_EXCEL_TEMPLATE = joinpath(@__DIR__, "..", "data", "blank.xlsx")

The issue with this is that this setup no longer works when I try to create a relocatable app that uses XLSX.jl. Two reasons are (i) @__DIR__ point to my local directory, not that in the new system, and (ii) PackageCompiler does not know it should include this file in the application.

This issue can be solved by converting this template file into an artifact. My question is: Can use tell the artifact system to get the artifact in the data directory, e.g. by using a path relative to the Artifacts.toml file, instead of from URL? This will make the app relocatable, because there is no longer a reference to @__DIR__, and PackageCompiler will ship the file. Using your code, I would find the correct @__DIR__ on the new system, but it would not help me get the file blank.xlsx.

The benefit of this option as opposed to the URL option is that it is no longer required for a package to go through the process of uploading files to a specific location, hosting a server to serve the files, and taking care of access issues to the server in case you are working behind a company firewall.

1 Like

Aside:
This is a thing you can do with DataDeps.jl using a ManualDataDep.
Folders in MyPackage/deps/data/ are on the datadeps load path if accessed from within that package.

I wouldn’t particularly advise it’s in storing large binary files in git to leads to suffering

The point of artifacts is that they are content-addresses and you don’t care about their location, so I don’t think that’s what you want here since you’re asking for “artifacts but addressed by local path”. What I think is wanted here is a straightforward way to refer to local data that is relocatable. This is similar to the way that include is always relative to the source file in which the include call occurs. Using @__DIR__ allows that but makes code non-relocatable, unfortunately.

I don’t think this needs to be an artifact unless you want to take advantage the properties of the latter (one of which is that you don’t need to care much about “location”).

Instead, I would do something like

module MyPackage
pkg_path(parts...) = normpath(@__DIR__, "..", parts...)
end

and then MyPackage.pkg_path("data", "myfile") should always refer to the correct file.

@oxinabox Thank you for the suggestion I will look into this option.

@Tamas_Papp Using @__DIR__ makes the app non-relocatable.

I don’t care about the location; I only need the file. In the case of XLSX.jl, we need the Excel file blank.xlsx. To achieve this, it would be great if I could write in the following in the artifact file, for example:


[blank_xlsx]
git-tree-sha1 = "43563e7631a7eafae1f9f8d9d332e3de44ad7239"

    [[blank_xlsx.get]]
    relpath = "data/blank.xlsx"
    sha256 = "e65d2f13f2085f2c279830e863292312a72930fee5ba3c792b14c33ce5c5cc58"

This would make it very easy to make e.g. XLSX.jl relocatable.

I think relocatable apps can be a great selling point for Julia. Coming from Python, I have always greatly appreciate the simplicity of handing users a binary that is generated e.g. by C++. Great progress has already been made towards achieving this in Julia with the Artifacts & PackageCompiler. A required next step is provide relocatable alternatives to @__DIR__ (such as proposed above) and then discourage(/warn about/deprecate/remove?) @__DIR__.

2 Likes

@sdewaele, at the moment, the Artifacts.toml file is only able to specify downloads of gunzips. I too have found myself wanting to include and track files as Artifacts. Perhaps the deps/build.jl can download files and handle various aspects of reproducibility for you?

Have a look at GitHub - CiaranOMara/ArtifactHelpers.jl: Bind and initialise reproducible Artifacts to see my current workflow. Maybe there is something there that’ll help.

Thank you for sharing this. I’ll look into that as well.

Sorry to revive the old thread, but I’ve tried following the ObjectDetector method of using Github releases to store the tarball artifacts, and I’m running into cert issues on github actions CI:

Downloading Downloadingartifact: tessdata_eng
25l artifact: tessdata_eng
25lcurl: /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libcurl.so.4: no version information available (required by curl)
curl: /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libcurl.so.4: no version information available (required by curl)

curl: (60) Cert verify failed: BADCERT_NOT_TRUSTED
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

It works fine locally for me, however. The CI run is here: Clean up readme; add test; fix CI · ericphanson/SearchablePDFs.jl@76f3723 · GitHub and the Artifacts.toml is here: SearchablePDFs.jl/Artifacts.toml at 76f3723eb07abe03868b479207bec0321e2d4f69 · ericphanson/SearchablePDFs.jl · GitHub.