Providing datasets in a package

Hi,

I want to include some dataset, on the form of an object (say of type Mytype from my package, so that import Mypackage will provide a Mypackage.myobject binding for this dataset.

I might have to do this for several datasets, and these objects are a bit complicated to instantiate: Let’s consider for the sake of the argument that they are constructed from a .csv file through a potentially long-running function build_my_object("this_particular_file.csv")

The issue is that I cannot serialize the object itself, since the definition of Mytype and of build_my_object might change from one package version to another.

Likewise, the .csv files might change from one version to the other… at least during early development.

How can i set up such a system that would build the objects at precompilation time ?

I’m a bit lost between DataDeps.jl, Pkg.Artifacts, ArtifactsUtils.jl… is there somewhere a guide on this particular problem that i could follow and/or an example of a package that does this already ?

You don’t have to do anything special to get it built at precompilation. Just

module MyPackage
const data = long_running_data_preparation()
end

will serialize MyPackage.data as part of the precompilation and import MyPackage will deserialize it.

You may or may not want to use some of the mentioned packages to manage the source of your data, but that is independent of the precompilation question.

2 Likes

Thanks for pointing this out! Looks fantastic that it will auto serialize it at precompilation, but now that you say it it makes sense.

So then I could use something fancy to store my .csvs and these files only. Do you have a recommendation for this ? I think it would make more sense to have them live in the package’s registry

I couldn’t remember exactly how precompiling worked, so I was playing around with it and didn’t get my reply in before @GunnarFarneback did, but maybe it’s useful to have a self-contained example in this thread:

$ julia -e "using Pkg; Pkg.generate(\"PrecompileExample\")"
  Generating  project PrecompileExample:
    PrecompileExample/Project.toml
    PrecompileExample/src/PrecompileExample.jl
$ echo "module PrecompileExample
f() = sleep(5); return 3
const x = f()
end # module" > PrecompileExample/src/PrecompileExample.jl
$ julia --project=PrecompileExample/ -e "@time using PrecompileExample"
  5.526768 seconds (141.25 k allocations: 10.659 MiB, 2.78% compilation time)
$ julia --project=PrecompileExample/ -e "@time using PrecompileExample"
  0.007801 seconds (5.61 k allocations: 374.812 KiB)

1 Like

It depends on how large the files are and what kind of storage options you have for them. However, if they are small enough that you can conveniently store them in the package repository, that’s likely to be the easiest option.

1 Like

Okay so i’ll simply do that, these files are not huge. Do i need a bit of caution to access them ? Should I put them in the src/ folder or somewhere else ?

You can place them wherever you want inside your package. You want to start from @__DIR__ when you construct the path to them, e.g.

csv_path = joinpath(@__DIR__, "..", "data", "data.csv")

if the code is in a file in src and you store the csv in a sibling directory data.

4 Likes

Perfect then I’ll use that. Thanks a lot for the guidance !