Workflow for hosting "data" packages with artifacts

There seems to be broad consensus that using artifacts is the preferred way of hosting data for a package. Say, hypothetically, I were to make a package which makes it easier for users to download datasets from Hayashi’s econometrics textbook.

The data is posted online here. But what if

  1. It wasn’t hosted online at all
  2. I don’t trust whoever maintains that website so that I don’t know if URLs are stable or the website will exist indefinitely.

It seems like it should be best practices to

  1. Find an OSS-friendly service that can host data indefinitely with a stable URL
  2. Host the data there in a convenient format (say, .csv)
  3. Use the artifacts infrastructure to download the data from that website as needed.

Do any packages actually do this? Does anyone have experience solving this exact problem?

Please let me know any thoughts.

In MixedModels.jl we use osf.io to host datasets in Arrow format for use in an artifact.

Thanks for this! This definitely seems like what I’m looking for.

Two questions:

  1. How do you manage updates to new versions of MixedModels? Let’s say you add an example dataset and release a new version of the package. It looks like you just make a new .tar.gz and then upload it to the osf link with a new version?
  2. If a package is truly “just” for data, presumably you don’t want an Arrow.jl dependency. The user can figure that out.

I would imagine the user’s experience would be something like

using Arrow
t = MyDataPackage.get_arrow_filepath(dataname) = # get link from artifact path 
Arrow.read(t)

Does this seem reasonable?

For question number 1, yes whenever we add or modify datasets we create a new .tar.gz file and modify Artifacts.toml as described in the comments (thanks to @palday for documenting that process).

Regarding question 2, we sort-of committed to using Arrow so the dependency on Arrow.jl makes sense for us. We have two functions datasets to list the available dataset names and dataset to retrieve a dataset by name which we did not export. The intention is that if other packages provide non-exported versions of those functions then the pattern could be

PackageName.datasets()  # returns names of datasets available in Package
foo = PackageName.dataset(:foo)  # retrieve dataset "foo" from Package

@quinnj recently added functions for setting and getting metadata from an Arrow table or column and these could be used for documentation of the datasets.

2 Likes