Static data in a package

What is the best place for static data in a package? The data in question is Waasmaier-Kirfel table of atomic X-ray formfactors. The current solution is to place the datafile in the subfolder “data” of the “src” folder, and load it at compile time in the following way:

function load_waaskirf()
    waaskirf_filepath = joinpath(@__DIR__, "data", "f0_WaasKirf.dat")
    t=readlines(waaskirf_filepath)
    # More code...
end

const wktab=load_waaskirf()

Is this the right way of doing things? In particular, is it OK to place the data in a subfolder of “src”?

4 Likes

Depending on how big the dataset is, you might consider DataDeps.jl instead. If it’s really big, you might also want to use JLD or some other way of storing directly as a Julia data structure so you don’t have the overhead of re-parsing it each time (DataDeps can do stuff like that too so you don’t have to store the large file in your repo)

But if it’s a relatively small file I think that’s a fine solution.

2 Likes

Thank you for the reference to DataDeps.jl, this seems to be the right way to distribute code examples using lots of data. However, in this case the data file is about 200 lines long, and its contents did not change since 1995. In this sense the data is as close to the fundamental physical constants as it gets, and distributing it with the package seems to be the right thing. The question is rather about the package layout - should we keep the data in a subfolder of “src” or elsewhere.

Incidentally, I wonder whether loading the data at compile time is OK. An alternative would be using the init() function, but it looks like the latter is considered as a fallback solution, for the cases when the compile time loading is not possible (e.g. when the package uses third part libraries needed to be initialized at runtime).

Given that it’s a small file, an alternative would be to just declare this as a constant in a julia file say waaskirf.jl with

const WAASKIRF = """
#F f0_WaasKirf.dat                                                           
#UT  Elastic Photon-Atom Scattering, relativistic form factors.
#UF0TYPE PARAMETRIZATION ; TABLE OR PARAMETRIZATION?
#UIDL xf0
(rest of the lines here)
""" # end

and then just include that in your package:

module Package
# ...
include("waaskirf.jl")
# ...
end

wherever before you had something like const WAASKIRF = read("path/to/waaskirf.dat", String).

1 Like

I would make a data/ directory and put it there. Also, declare a

data_path() = abspath(joinpath(@__DIR__, "..", "data", "constants.dat"))

function in the source.

Alternatively, depending on the format, putting it in a Matrix or Dict in the source code could be fine, too.

4 Likes

Along the lines of what @tlienart proposed, in PeriodicTable.jl and PhysicalConstants.jl we hard coded the (small) datasets.

2 Likes

I would suggest that the correct location is:

/deps/data/Constants/constants.dat

i.e. from /src/

joinpath(@__DIR__, "..",  "deps", "data", "Constants", "constants.dat"

This is where DataDeps looks for package specific data <CURRENTPKG>/deps/data/ is always on the DataDeps load path.
One can reference data there with a ManualDataDep,
but there is little need to.

Alt: and especially if it is plan text: tlienart’s suggestion is pretty solid, since it is so small.

I suggest using a raw string macro:

const WAASKIRF = raw"""
 Why use raw string  ? 
 $ are not interpolated in raw strings
 backslashed (\) are not treated as escapes -- no \n to newline 
(rest of the lines here)
""" # end
2 Likes

The main reason of not putting the data directly in a source file is to keep the data file “pristine”, as it should be for the third party component.

It seems that the common practice is either to put the static data either in “data” or in “deps/data” directory in the main package folder. I thought at first that “deps/data” is a better choice (“deps” is already a de-facto standard place for the third party stuff), but it seems that “deps” is mostly used as the placeholder for the build.jl script and the downloaded binary third party libraries. Thus “data” dubdirectory of the main folder seems to be the right place.

AFAIK there is no strong convention, so anything goes at this point. data/ is a sensible choice.

If these things get formalized later you can always change it, I would not worry too much about it.