Recording simulation data

When doing research, I like setting up experiments in Julia to verify my theoretical findings or to simply make sense of the concepts I’m dealing with. The point is, my experiments are usually simple, small and I’d like to set them up and get data quickly.

Now, my problem is that eventually as my experiments grow in complexity it becomes cumbersome to store and handle the data generated by these simulations in a systematic way. I couldn’t find any information on how to store data in a “proper” way on here or other Julia forums.

As an example, I have a function run_test(params...) that takes around 6 parameters and outputs a vector of variable length specified by one of the parameters (the entries in the vector are 0, 1 or Nothing if it makes a difference). Say I want to run this test 1000 times for each combination of 2 parameters that I’m varying (while keeping other parameters fixed).

Should I create a DataFrame in which each row of data represents one run and contains the value of all parameters? Should my output vector be stored in a single cell of the DataFrame? Is it better to define a lot of columns to store each entry of the vector (provided I know how big it will ever get)?

Or should I store each 1000 runs as a vector of vectors and put in a Dict with the key containing the setup? I feel this is more efficient but harder to manipulate.

In general, is it worth it creating a struct that will hold the parameters or the output? Is it a good idea to store structs in a DataFrame or other data structures?

What if I want to keep the log of all past experiments? Should I load/save csv’s with DataFrames? JLD2 the dicts?

I hope these questions give an idea of what I’m struggling is. I feel like there must be good practices of working with data and I’d be very grateful for any pointers in that direction. Perhaps this is a more general question but I wonder if there is any advice that is specific to Julia.

Thanks a lot in advance!

Stas

1 Like

It sounds like you might be interested in DrWatson.jl which I think helps with this stuff (I have never tried it myself though).

10 Likes

You might want to look at https://github.com/JuliaDynamics/DrWatson.jl
As far as I understood, this package is exactly made for this purpose.

7 Likes

DrWatson.jl

From what I could gather from the web site/JuilaCon video:

DrWatson is mostly designed to ensure you have a reproducible experimental environment as you share it with colleagues.

  • So yes, DrWatson might be a good idea for keeping a log of all past experiments.

Also: DrWatson would indeed address your issue by auto-generating file names from your input parameters so you can store individual simulation results more easily. From there, you can probably choose whatever tool you wanted to save those sim runs to disk (ex: Write out DataFrames to disk or using csv or something).

But I’m not sure that it directly addresses your issue for storing data.

HDF5.jl

If, on the other hand, you would like to store all your data to a single file, I would suggest a hierarchical data format or one that can store multi-dimensional arrays. HDF5 does both.

GitHub - JuliaIO/HDF5.jl: Save and load data in the HDF5 file format from Julia

From what I can gather, you will probably be generating fairly large amount of data - and HDF5 is a good format for that as well. It apparently has its issues, especially when deleting data, but I have found it to work very well for my applications.

  • If the data you are saving has a hyper-rectangular shape (table of Ni x Nj x Nk x … elements), you can simply stuff all of it in a single N-dimensional array, and write that directly to HDF5.
  • If, on the other hand, some of the simulation runs output vectors of different lengths, then I suggest creating a separate hierarchical “group” (like directories) for each run. Alternatively, you could simply write out separate datasets for each run, and embed parameter values in the name of said datasets.

One note: You might want to sub out nothing values with NaN values (using Floats instead of Ints or Bools). This will give you a means to represent nothing (as NaN) using HDF5’s built-in support for multi-dimensional arrays.

3 Likes

CMDimData.jl (using HDF5.jl)

GitHub - ma-laforge/CMDimData.jl: Parametric analysis/visualization +continuous-f(x) interpolation

If you want a solution that already deals with input parameter sweeps for:

  • running your math simultaneously on all swept values
  • storing the results to a single HDF5 file
  • plotting the entire dataset (or a subset of it) as if it was a single x,y vector

You might want to give CMDimData.jl a try.

I created an example below assuming those 1000 runs were monte-carlo iterations (though I only did 100):

using MDDatasets #Multi-dimensional functionnality
using CMDimData #Environment to save/plot data from MDDatasets
using CMDimData.EasyPlot #Generic plotting facilities

#Use InspectDR as plotting backend:
CMDimData.@includepkg EasyPlotInspect; pdisp = EasyPlotInspect.PlotDisplay()

#Use EasyData (and load HDF5 library):
CMDimData.@includepkg EasyData #Save multi-dimensional dataset to hdf5 file


#==Input data
====================================================#
npts = 200
samples_per_cycle = 50
xrng = 1:npts #Julia range
x = DataF1(1:npts) #x-values as function of 1 argument (vec of {x,y} pairs)


#==Simulation itself
====================================================#
function run_simulation(x, param1, param2)
	noise_floor = DataF1(x.x, randn(length(x))*.5)
	y = param1*cos(x*(2*pi/samples_per_cycle))+param2
	return y+noise_floor
end


#==Run simulation for all parameters (incl. monte-carlo iterations)
====================================================#
#Create/populate multi-dimensional "signal" object:
signal = fill(DataRS, PSweep("MC", collect(1:100))) do i_mc
	fill(DataRS, PSweep("param1", [0.5, 1, 1.5])) do param1
	#Inner-most sweep: need to specify element type (DataF1):
	#(Other (scalar) element types: DataInt/DataFloat/DataComplex)
	fill(DataRS{DataF1}, PSweep("param2", [1, 4, 16])) do param2
		sig = run_simulation(x, param1, param2)
		return sig
end; end; end


#==Generate plot
====================================================#
_title = "Monte-Carlo simulation"
axrange = cons(:a, xyaxes=set(ymin=-5, ymax=20))
line_attr = cons(:a, line = set(style=:solid, width=2))
plot = cons(:plot, axrange, nstrips=1, title=_title,
	ystrip1 = set(axislabel="Amplitude", striplabel=""),
	xaxis = set(label="Time (s)"),
)
push!(plot,
	cons(:wfrm, signal, line_attr, label="result", strip=1),
)


#==Display results in pcoll
====================================================#
#Need a plot collection (multi-plot window), to use plots:
pcoll = push!(cons(:plotcoll, title=""), plot)
gplot = display(pdisp, pcoll)


#==Save data to HDF5 (could also have saved plot)
====================================================#
EasyData.openwriter("output.hdf5") do w
	write(w, signal, "simresult")
end

EasyData.openreader("output.hdf5") do r
	signal_from_file = EasyData.readdata(r, "simresult")
end

println("Do something to keep last value from being dumped to REPL")
3 Likes

Thanks, guys, these are some fantastic suggestions. I will probably try out all of them and see what works best.

The questions you raise are exactly what I am also constantly thinking about during my PhD project right now. If a best practice guide exists, that would be awesome! I haven’t found one yet.

Maybe a few questions first:

  • how long does a simulation run take?
  • how much will the simulation change? Is it more or less final, or do you want to extend the model in the future?

I haven’t found the perfect solution. But here are a few thoughts for my setup (fast runtime for single simulation, many changes of the model expected):

  • As long as the simulation core is evolving, no format would be perfect: if you add new parameters/outputs, old simulations need recomputation. I gave up on trying to have a format for the output which remains valid all the time. Instead, only the stored parameters should be in a format which might survive if I add new features.
  • I started as well with recording all simulation data, but after each change you have to change a lot of code to make read/write work again. So, now I only store the output of certain “goal functions” (less than 10) instead of saving all degrees of freedom of my simulation (more than 300).

My current packages are:

  • JSON3.jl to write/read parameters as .json (I have nestled parameters.)
    My Parameters are stored as structs with Base.@kwdef to allow nice initialisation and defaults.
    With JSON3.jl is it quite easy also to store complex data types in json.
    With JSON3.jl this can be converted into a (nestled) Dict and if you want a list, you can flatten the Dict and store it in a table.
    • If a parameter value is not present in the file, you can use a predefined default value. This makes old parameter files compatible with new ones.
    • If you only have a few parameters, you could also use NamedTuple. It is faster than Dict and you don’t need to define what it contains before initialisation.
  • Pluto.jl (or jupyer) to store experiments/modification of the core simulation in notebooks. This changes a lot about how you experiment and you can just store a notebook with some crazy ideas instead of always changing the core simulation. Also, it’s interactive, which is good to find good parameters!
    • In my case, I also wanted to vary several parameters and find the best. Adding a few sliders from them and just changing them and looking a the results in realtime gives a good intuition of what is going on.
    • I guess Pluto.jl could be an alternative to CMDimData.jl, with more flexibility but less out-of-the-box functions.
  • BlackBoxOptim.jl offers multi-criteria optimisation algorithms. If you know what the goals of your simulations are, then you could find optimal parameters with the borg_moea method. This also comes with a structure to store all parameters and the values of the goal functions.
  • DrWatson.jl to keep track of the source code version.

2 Likes

This is a somewhat difficult problem in general, and I am not sure that the approach of DrWatson.jl is the right one as the code in a project is also an input. Ideally a repo would record hashes of everything, including the code, and have a way of generating data as a node on a graph. But this may make interactive workflows tricky.

1 Like

I use the jsonlines format for this type of stuff. Just append json objects + newline for each experiment. Something like

{"testid":1, "parameters": [1,2,3], "result":[23.1,34.4,...]}
{"testid":1,...}

I usually use JSON3 for writing and

https://github.com/danielw2904/JSONLines.jl

For reading.

I don’t know if that is the best way to do it but it works for me when scraping data sequentially.

1 Like

Thanks for a fantastic reply! To answer your questions, my simulations take between a few seconds and up to a day - it really depends on the problem. The models tend to evolve gradually as I implement new things to test.

I’m using Jupyter as a scratchpad and for testing and then save the functions I intend to reuse to a .jl file. I was thinking of creating a package for each project but it seems like it will slow down the process. I also liked the look of Pluto at JuliaCon, I’m just a bit reluctant to adapt to a new workflow.

I agree that it would be great to have a guide to good practices or at least examples of well-written code :slight_smile: