Exploratory research project workflow

People scale academic research, what is your workflow in Julia?
I mean,

  • how do you develop your project,
  • in which point move functions to the module,
  • when writing tests,
  • how you scale,
  • how you pass parameters between different parts,
  • how do you save an intermediate representation?

Recently I watched a great talk by Gaël Varoquaux at Python Summit. I was surprised, that his approach was similar to what I do eventually after years of struggling. It seems that the problem of the evolving code is common in many fields. I am curious to hear how people deal with it.

4 Likes

Find a similar post

with a lot of useful info

Let’s narrow the scope. There are things that are a must:

  • interactive environment with inline evaluation (VScode)
  • version control. Git
  • heavy data files stored separately
  • a lot of plots

What I usually do is to create a folder for my new project, then a file in there and start coding, load the data, make some plots, write some functions, etc. Then if my file is getting big, I want to reuse code in another file, and my functions don’t change much anymore I create a “lib” folder and move my functions in files that I include in my scripts. Usually I also have a “plot” folder for plots, a “data” one for saving stuff (CSV and JLD2 mostly) and a “config” one for parameters and constant data. For tests at that point I sometimes just use some @assert in my files and look at the plots.

Then if I figure out that some of the code I wrote will be useful for other projects in the future and contains a coherent set of concepts, I move it to a package and clean it up, but that doesn’t happen very often.

Essentially I start in the most lazy way possible and make things more solid only when needed. That said it depends a bit on the project, that workflow is really for exploratory, one-shot analysis. If I have a good foreview on where the project will go I might be more organised. I usually git the folder but not always include a Project file (more by laziness than anything).

1 Like

You might be interested in DrWatson.jl which basically automates this setup process and has some utility functions for navigating and activating a project.

4 Likes

I see, great, it also looks similar to what I do, however, I always create a module to have the namespace and a nice way to load all function and structure folders (src, test, scripts).
How do you load your library without a module, by include?

The other thing,
I am currently struggling to find a way to run the whole analysis that is a chain of different scripts with intermediate outputs. Any experience?

DrWatson helped me realizing how important it is to activate local environment before running the code. Now, vscode does it automatically, so I stopped using it. For generation of packages, I use PkgTemplates that also generates settings for the GitHub CI.

An exploratory workflow with a main.jl file which calls different scripts with include is a very good / acceptable workflow.

If you are only working with a single file then you want to use Revise.jl’s includet, which means you dont have to run include all the time.

But you should really be trying to put everything in functions.

Here’s what I do.

  1. Start out with main() at the bottom of your script.
  2. Use includet on that file and call main() when you want to run everywhint
  3. Write everything in main, when it gets too big, break things into smaller functions that are called within main
  4. Use Exfiltrator.jl extensively in debugging, as @exfiltrate; error("") when you want to break and send everything in that function’s local scope to global scope for inspection at the REPL
  5. Use dictionaries, named tuples, and custom structs to store information. Pass large objects to the functions and use @unpack to work with just the things you need. This means you don’t have to agonize about what you are and are not passing to your sub-functions.
  6. Just run main() all the time. If you use includet then everything will always be up to date. It’s very freeing.
2 Likes

Bold of you to assume I only have one workflow, XD.

It depends.

  1. If it is just a test/prototype that I can end up throwing in the trash by the end of the day I just create folder, active an environment, install the necessary packages and use a single jl file with everything inside main. If small enough I do not even start with git.
  2. An intermediary case is data analysis for a paper. In this case the git repository is my PhD repository where I store all papers I am working on. I just create a subfolder in the latex folder of the paper and use Jupyter.
  3. If it is something I will dedicate some time (weeks, months, …) and is not just data analysis, then I already start by using PkgTemplates to start a package (I do not remember if this already makes it a git repository, but I make sure it is a git repo). Does not matter if I intend to register it after or not, I do because being a package allows me to dev it in the general environment and import it in short scripts I may write.

If (1) grows it becomes (3) and if (3) grows I start dividing it in submodules. Basically, I use CameCase for the .jl files that are a submodules (separation of namespace) and snake_case for the ones that are not submodules (i.e., that are just part of a module/namespace but separated for ease of search and reading). If a submodule is not a single file (i.e., it is broken in sub-submodules or has code separated in multiple snake_case.jl files) then they get their own folder (inside src), recursively. snake_case.jl are included a single time inside respective modules. CamelCase.jl are often all included also a single time but at the start of the main package module, and if they have dependencies for each other they just import by means of the parent module (e.g., MyPkg has submodules A and B, that are included just after module MyPkg line, and if B needs something from A, then inside module B I do import ..A: name_of_function_needed).

This is a constant struggle for me too. I’m starting to become more and more a fan of starting with creating a package straight away and create everything from the REPL. Then, tools like LaTeX or Pandoc can pick the output up. For me, scripts will always lead to lots of code duplication and the problem of naming the scripts. Simply put: the benefit of just having a package is that all the logic is split into functions instead of files, so it is more versatile. Added benefits are the included tools for

  • testing
  • package version control (via [compat] or Manifest.toml)
  • documentation generation

Also, packages work well with the REPL and Revise.jl. With scripts, I’m always having to manually ensure that the state of the loaded code is the same as the state of the running code. Finally, working in package has standard solutions for most of your suggestions too because package developers also have to deal with writing test, scaling and passing parameters between different parts.

For me, running time is not really a thing, so I would just ensure that I can reproduce it by going back in Git’s history.

But, as I said, I’m also struggling with this so there might be better approaches.

3 Likes

That sounds interesting, I would like to understand it better. Do you have any public examples to look at?
Do files that are include-d do some work or only define functions?

It is all clear about (1) and (2).
The case (3) is what I am interested in the most. My projects are of several months long.

What you do with modules and submodules sounds interesting. What is the use case for the submodules?
I often struggle with placing functions of different generality in modules. For most of the functions, I have a general version and one, less general that takes some const object in. Maybe it is exactly the use case for the submodule … (?)

We use dvc.org in our tech stack. This is a tool built on top of git that versions your data, controls data provenance, ensures reproducibility, and maintains your computational pipelines.

Generally, our file directory has a scripts/ folder which contains scripts that call functions from a module defined in src/. The outputs of those scripts gets stored in subfolders in a data/ folder, and those outputs are tracked by dvc (as well as the commands used to run the script, and any inputs). Data is stored in any number of formats - could be .csv or .sqlite or a binary format that Julia can directly read like BSON.

In dvc, we mark the inputs and outputs to each particular stage, and dvc then builds up a DAG which charts the computational pipeline. If some intermediate stage has changed, dvc detects that change and re-runs anything that it depends on (as you specify).

4 Likes

I’ve summarised my own workflow here in the past. It basically comes down to minimizing performance problems & maximizing convenience/adaptability for me. Version management is still done by hand though, but all the fancy stuff your editor does for you anyway should work out of the box. I definitely use git repositories for every module though.

I don’t usually use submodules. If it’s important enough to factor out into a submodule, it’s usually important enough to make it its own thing.

1 Like

Interesting! How does it work precisely?

Could you tell me which parts are (the most) unclear? Then, I’ll elaborate on those

Wow, sounds just fantastic! Maybe that is what I am looking for.
Thanks! going to try on the weekend :slight_smile:

I was just confused about how LaTeX or Pandoc can pick up the output?
What kind of output?
Are you creating plots with PGFPlots? or maybe generating the text of your papers with Julia :slight_smile: ?

1 Like

Dvc sounds cool. What do you think about it? I dislike it when tools work for the most part but don’t allow me to do fine-tuning (for example, when a tool manages citations but sometimes does it wrong). Does that happen with dvc?

Edit: And is it difficult to setup / maintain? I ask because I see mentions of databases, config files and cloud storage.
Edit2: Do you also know how it differs from Pachyderm?

Ah, yes. Plots and tables. Latexify.jl has some nice conversions from DataFrame to other formats