Computational experiments: organising different algorithms, their parameters and results

Hi everyone,

I am trying to run a bunch of computational experiments, different algorithms on different problems, using different sets of parameters each, pretty simply right? Well I am wondering what the best pipeline would be to store interesting statistics and data for each experiment run in a way that is easy to store and process later on. My limited SQL knowledge is forcing me to think relational databases, with a table for algorithms, one table for each algorithm’s set of parameters linking to the algorithm record, a table for the problems, and finally a table for the runs and shared summary statistics and results linking to the algorithm record, the parameters’ record and the problem record. Any algorithm-specific results can also be written in a table for each algorithm and linked into the experiment runs table. So I wonder, what is the best way to do this programmatically? I am guessing it has to do with DataFrames.jl but is there any more suitable package for organising this mess that you are aware of? I am really not familiar with the data ecosystem in Julia but I know there are too many options out there and trying all of them is currently not an option for me, so I would appreciate a nice holistic summary of what’s available from an expert.

I am also interested in any generic advice if you may. Thanks in advance!

1 Like

I think your best bet is to just at least start off with some custom types and using them as containers for your data. I can’t offer much more than that (nor am I even close to being an “expert”) because it sounds like you’re at the very beginning of your task. My advice is start off with something (be it DataFrames or SQLite or anything in https://github.com/JuliaDatabases or https://github.com/JuliaData) and please come ask as soon as you have some specific questions/problems that are holding you back.
Good luck!

Hi, thanks for your comment. This however will soon get hard to manage when I starting mixing different types of results and data from different algorithms. Also to make it easier to post-process later, and to separate the data generation stage from the post-processing stage, I believe I have to dump these into some files anyways. So I was trying to get advice on the best way to organise these files. Naively, I could have one txt file for each run and perhaps organise the files in folders by algorithm or by problem, or both. I can do some file naming standard and all that, but I feel I will only be strangling myself the more of these files I make, since I will have to read them again to post-process. Also querying the results will be a nightmare and feels like reinventing the wheel since all the beautiful database machinery is already out there.

I guess DataFrames.jl and Query.jl might be what I am looking for.

Yea, sounds like a great point to start. For the files you can either structure on your own into JSON(2) and or use JLD.

1 Like

I am assuming that the experiments are costly (in terms of runtime) relative to the data they generate. I usually do this the following way:

  1. Have each experiment / parameter / run combination mapped to a unique filename in a directory. This can be deterministic, eg experiment-01-params-97-run-07.dump, or “next available” according to some scheme (need to consider race conditions when running experiments in parallel; I just check for file existence).

  2. Each run is in a separate Julia process, and dumps the results I need to the corresponding file. I usually use JLD2.jl for this, but lately I have been experimenting with BSON.jl. I try to overdo metadata (as it is costly to recover later on), so I dump random seed, runtime, outcome, etc. I also minimize custom types, basically tuples of tuples are fine. In 0.7 named tuples are especially nice for this.

  3. When done, ingest all data dumps in a loop and format to “tidy data”, for analysis.

This is less sophisticated than SQL, but works nice in a minimal setup server environment.

6 Likes

Thanks for your comment. I like your approach so I think I will go with it if the database thing didn’t work out. I like the database approach for its flexibility once everything is in place. So far, I like JuliaDB.jl and DataFrames.jl for playing with tables, Query.jl for querying the tables in SQL-like syntax, and StatPlots.jl for convenient plotting. I just need to get my relational database design fixed, then perhaps make an AlgorithmExperiments.jl package or something that makes it easier to do my use case in the future. Let’s see!

There is a Python package called Sumatra that does this kind of thing, that may be useful directly or as inspiration:

https://pythonhosted.org/Sumatra/

3 Likes

Generic advice on reproducibility. When I’m running computational experiments, I find that the code will often need to change over the long run, and this may break compatibility in how results are saved as compared to a couple weeks back when I was interested in a different observation.

My primitive technique is to have the julia code simply copy itself (*.jl) into the results folder, along with all input parameters, so that at any time, I could go back and re-run that specific simulation. Another post-doc I worked with also had the clever idea to programmatically query the git repo and save off the commit hash, as a means of making the work reproducible.

I haven’t worked with Sumatra yet, I’ll have to also go check that out.

10 Likes

Great point!

Great solution!

1 Like

Sumatra looks cool. I think it has all the components there so I can just make a more Julian version of it, that is possibly even easier to work with when post-processing.

1 Like

I 've been looking at Sumatra for a while now, but never took the step to use it for Julia.
It would be great if someone ported that to Julia. I’d love to help somehow, but I’m afraid my Julia programming skills are not good enough.

I recently wrote a blog (Responsible research: configuration version-controlled – Wenjie Zheng – Statistical Learning Solution Expert) about it. It has somewhat the flavor of what @Alan_Bahm said.

It might be an overkill for your problems, but I’m sure you can find a balance between my solution and your need.

I might write some example about how to use the practice I advocate once I finish looking for a job.

This is a nice article, but it lacks the ease of post-processing since one would presumably have to hunt down the commits for each parameter configuration which I imagine would be nothing less than nightmare material, if I understood correctly. This is exactly why I am implementing a database to manage the mess.

You can use it in a hierarchical way: use the database or anything similar to store a group of predefined experiments, and use my solution to manage the breaking changes that would destroy the structure of your database.

Hmm that’s possible, except a good database design should not be destructible :wink:

I’m not an expert of databases, but how would you design a database so that its design won’t be outdated when new experiment ideas keep coming out during the research?

Every experiment compares a number of implementations. If you change the type of results or parameters, just make a new implementation. Parameters and results are data not columns. The following is a rough design I prepared 2 days back. It changed now slightly but it still shows the idea.

1 Like

I see you are trying to make a very flexible database design. As a brainstorming, you may want to consider the following scenario:

Day 1. You design an algorithm A, and you make a table for it
Day 2. You add a normalization step A.a to it, and you make another table for it.
Day 3. You add another normalization step A.b to it, and you make a third table for it.
Day 4. You want to change the normalization way of A.a from 0 mean to 1 mean (it’s a weird idea, but you get what I try to say), and you make a fourth table.

At the end of the month, you find you keep making small tables, which makes the table structure less relevant.

The problem here is: an algorithm can have its parameters and its sub-algorithms, and these sub-algorithms can also have their own parameters and sub-sub-algorithms. This is recursive process. How can a database design handle it without keeping making small tables?

Edit: change “configuration” to “parameter”.

Hmm, I am assuming that every implementation has a function which accepts a set of parameters. You can try different combinations of parameters in every run. This what the “Run implementation parameters” table stores, the values of all the parameters of a certain run. If you want to try a different normalization step, then make this a parameter and try as many as you want in different runs. The implementation is the same, just passing through different parameter values when the implementation’s function is called. Adding more parameters after the fact should also be possible but will require setting a value for the new added parameters for the old runs, which should not be difficult. I think the best way to fully understand what I am doing is to wait for the debut of this package, it is not far from its initial release, I hope!

On the sub-algorithms point, this is a valid point, but I am assuming the algorithm is flattened, so all sub-algorithms define an algorithm. This may be expanded in the future. Alternatively, you can make the choices of the sub-algorithms parameters and that way you have a more generic algorithm.

After a bit more thinking, I find your package can be one implementation of my solution. Since my solution doesn’t impose any implementation tools, you can go with any tools. In particular, your database implementation may help the navigation and visualization and remove some burden from the research diary.

Good luck with your package and keep me posted!