I’ve already looked at several tutorials on code layout that are a bit long. But I can’t apply them to my case because I can’t see my mistake. I use Module.
Here’s the problem. I have a code that contains a lot of data that I would like to separate from the calculation code.
Let’s say my data are in Data.jl and my computation code is Calc.jl
Data.jl contains:
> module Data
> # Here are data
> end
What exactly do I need to do to make Calc.jl aware of the data contained in the Data module? Where should the “include” and “using” instructions appear?
Thank you for your answer and sorry for disturbances since I’m sure that this is a very basic question that has been answered in extenso.
Are Data and Calc in the same big repo? Do they belong together or do they make sense independently?
What do you mean by “code that contains data”?
Before we discuss modules, you need to know that include has nothing to do with modules (unlike Python’s import). Conceptually it just copy-pastes code from one file to another, which means it’s just a way to have smaller files but it does not influence the logical structure of your code. Thus you can reason as if your whole code were in one single file.
Yes: files Data.jl and Calc.jl are in the same repository.
Calc.jl must access to all the data contained in Data.jl
In its original version my code contains:
A lot of physical data, and a second part resolves a differential equation and some others computation.
I would like to group together data in a separate file (say, Data.jl) so that the main code access to the data. So I could change data within Data.jl without modifications of the computation part.
Perhaps I’m misunderstanding here, but I wouldn’t use a package to encode data — instead I’d use the raw/original data file(s) that make sense for your application, in whatever raw format you have them in (be it CSV or HDF5 or Parquet or whatever). And if your data are bigger than a handful of MBs or change frequently, then I wouldn’t commit them into a git repository at all.
You could have a package that helps you read in data files or a submodule that helps you pre-process them, though! There are indeed several ways to arrange two modules together, and what makes the most sense would depend upon the particulars of your use-case.
The simplest way would be two different files within the same module.
To create the right file structure, open a Julia REPL in Pkg mode (with ]) and then run
pkg> generate MyPackage
Then you can add two files data.jl and calc.jl to the src folder, before editing your main file src/MyPackage.jl like so:
module MyPackage
include("data.jl")
include("calc.jl")
end
As long as you include data.jl before calc.jl, the objects defined in the latter will have access to the objects defined in the former.
To use your brand new package, just stay in the Julia REPL, activate the environment corresponding to MyPackage and then you can do
I don’t want to derail the discussion but if be interested what your advice here would be instead.
Personally, I like to keep even moderately sized data files as git lfs because it is nice to be able to clone the repo containing the data and my evaluation scripts. But I admit it can take quite long to run git commands.
Before I used to keep the data files in a separate locations from the git folder but I didn’t like that so much because it is less reproducible and I often work on several machines depending which one is more convenient…
I use most frequently use JuliaHub’s DataSets these days (as you might expect ) — when executing a batch job it records the exact version of both the code and data accessed.