Workflow for hosting "data" packages with artifacts

pdeffebach · November 27, 2020, 3:17pm

There seems to be broad consensus that using artifacts is the preferred way of hosting data for a package. Say, hypothetically, I were to make a package which makes it easier for users to download datasets from Hayashi’s econometrics textbook.

The data is posted online here. But what if

It wasn’t hosted online at all
I don’t trust whoever maintains that website so that I don’t know if URLs are stable or the website will exist indefinitely.

It seems like it should be best practices to

Find an OSS-friendly service that can host data indefinitely with a stable URL
Host the data there in a convenient format (say, .csv)
Use the artifacts infrastructure to download the data from that website as needed.

Do any packages actually do this? Does anyone have experience solving this exact problem?

Please let me know any thoughts.

dmbates · November 27, 2020, 3:27pm

In MixedModels.jl we use osf.io to host datasets in Arrow format for use in an artifact.

pdeffebach · November 27, 2020, 3:56pm

Thanks for this! This definitely seems like what I’m looking for.

Two questions:

How do you manage updates to new versions of MixedModels? Let’s say you add an example dataset and release a new version of the package. It looks like you just make a new .tar.gz and then upload it to the osf link with a new version?
If a package is truly “just” for data, presumably you don’t want an Arrow.jl dependency. The user can figure that out.

I would imagine the user’s experience would be something like

using Arrow
t = MyDataPackage.get_arrow_filepath(dataname) = # get link from artifact path 
Arrow.read(t)

Does this seem reasonable?

dmbates · November 27, 2020, 4:47pm

For question number 1, yes whenever we add or modify datasets we create a new .tar.gz file and modify Artifacts.toml as described in the comments (thanks to @palday for documenting that process).

Regarding question 2, we sort-of committed to using Arrow so the dependency on Arrow.jl makes sense for us. We have two functions datasets to list the available dataset names and dataset to retrieve a dataset by name which we did not export. The intention is that if other packages provide non-exported versions of those functions then the pattern could be

PackageName.datasets()  # returns names of datasets available in Package
foo = PackageName.dataset(:foo)  # retrieve dataset "foo" from Package

@quinnj recently added functions for setting and getting metadata from an Arrow table or column and these could be used for documentation of the datasets.

Topic		Replies	Views
Workflow for using package artifacts Package Management artifacts	1	432	March 2, 2022
A couple of questions about Artifacts General Usage question	11	3053	April 24, 2020
My experiences using Pkg.Artifacts for test data General Usage	1	547	March 25, 2020
Using Artifacts for test data General Usage question	4	601	October 25, 2022
Artifacts vs datadeps General Usage	4	803	May 18, 2020

Workflow for hosting "data" packages with artifacts

Related topics