There seems to be broad consensus that using artifacts is the preferred way of hosting data for a package. Say, hypothetically, I were to make a package which makes it easier for users to download datasets from Hayashi’s econometrics textbook.
Thanks for this! This definitely seems like what I’m looking for.
Two questions:
How do you manage updates to new versions of MixedModels? Let’s say you add an example dataset and release a new version of the package. It looks like you just make a new .tar.gz and then upload it to the osf link with a new version?
If a package is truly “just” for data, presumably you don’t want an Arrow.jl dependency. The user can figure that out.
I would imagine the user’s experience would be something like
using Arrow
t = MyDataPackage.get_arrow_filepath(dataname) = # get link from artifact path
Arrow.read(t)
For question number 1, yes whenever we add or modify datasets we create a new .tar.gz file and modify Artifacts.toml as described in the comments (thanks to @palday for documenting that process).
Regarding question 2, we sort-of committed to using Arrow so the dependency on Arrow.jl makes sense for us. We have two functions datasets to list the available dataset names and dataset to retrieve a dataset by name which we did not export. The intention is that if other packages provide non-exported versions of those functions then the pattern could be
PackageName.datasets() # returns names of datasets available in Package
foo = PackageName.dataset(:foo) # retrieve dataset "foo" from Package
@quinnj recently added functions for setting and getting metadata from an Arrow table or column and these could be used for documentation of the datasets.