Exploratory research project workflow

An exploratory workflow with a main.jl file which calls different scripts with include is a very good / acceptable workflow.

If you are only working with a single file then you want to use Revise.jl’s includet, which means you dont have to run include all the time.

But you should really be trying to put everything in functions.

Here’s what I do.

  1. Start out with main() at the bottom of your script.
  2. Use includet on that file and call main() when you want to run everywhint
  3. Write everything in main, when it gets too big, break things into smaller functions that are called within main
  4. Use Exfiltrator.jl extensively in debugging, as @exfiltrate; error("") when you want to break and send everything in that function’s local scope to global scope for inspection at the REPL
  5. Use dictionaries, named tuples, and custom structs to store information. Pass large objects to the functions and use @unpack to work with just the things you need. This means you don’t have to agonize about what you are and are not passing to your sub-functions.
  6. Just run main() all the time. If you use includet then everything will always be up to date. It’s very freeing.
2 Likes

Bold of you to assume I only have one workflow, XD.

It depends.

  1. If it is just a test/prototype that I can end up throwing in the trash by the end of the day I just create folder, active an environment, install the necessary packages and use a single jl file with everything inside main. If small enough I do not even start with git.
  2. An intermediary case is data analysis for a paper. In this case the git repository is my PhD repository where I store all papers I am working on. I just create a subfolder in the latex folder of the paper and use Jupyter.
  3. If it is something I will dedicate some time (weeks, months, …) and is not just data analysis, then I already start by using PkgTemplates to start a package (I do not remember if this already makes it a git repository, but I make sure it is a git repo). Does not matter if I intend to register it after or not, I do because being a package allows me to dev it in the general environment and import it in short scripts I may write.

If (1) grows it becomes (3) and if (3) grows I start dividing it in submodules. Basically, I use CameCase for the .jl files that are a submodules (separation of namespace) and snake_case for the ones that are not submodules (i.e., that are just part of a module/namespace but separated for ease of search and reading). If a submodule is not a single file (i.e., it is broken in sub-submodules or has code separated in multiple snake_case.jl files) then they get their own folder (inside src), recursively. snake_case.jl are included a single time inside respective modules. CamelCase.jl are often all included also a single time but at the start of the main package module, and if they have dependencies for each other they just import by means of the parent module (e.g., MyPkg has submodules A and B, that are included just after module MyPkg line, and if B needs something from A, then inside module B I do import ..A: name_of_function_needed).

This is a constant struggle for me too. I’m starting to become more and more a fan of starting with creating a package straight away and create everything from the REPL. Then, tools like LaTeX or Pandoc can pick the output up. For me, scripts will always lead to lots of code duplication and the problem of naming the scripts. Simply put: the benefit of just having a package is that all the logic is split into functions instead of files, so it is more versatile. Added benefits are the included tools for

  • testing
  • package version control (via [compat] or Manifest.toml)
  • documentation generation

Also, packages work well with the REPL and Revise.jl. With scripts, I’m always having to manually ensure that the state of the loaded code is the same as the state of the running code. Finally, working in package has standard solutions for most of your suggestions too because package developers also have to deal with writing test, scaling and passing parameters between different parts.

For me, running time is not really a thing, so I would just ensure that I can reproduce it by going back in Git’s history.

But, as I said, I’m also struggling with this so there might be better approaches.

3 Likes

That sounds interesting, I would like to understand it better. Do you have any public examples to look at?
Do files that are include-d do some work or only define functions?

It is all clear about (1) and (2).
The case (3) is what I am interested in the most. My projects are of several months long.

What you do with modules and submodules sounds interesting. What is the use case for the submodules?
I often struggle with placing functions of different generality in modules. For most of the functions, I have a general version and one, less general that takes some const object in. Maybe it is exactly the use case for the submodule … (?)

We use dvc.org in our tech stack. This is a tool built on top of git that versions your data, controls data provenance, ensures reproducibility, and maintains your computational pipelines.

Generally, our file directory has a scripts/ folder which contains scripts that call functions from a module defined in src/. The outputs of those scripts gets stored in subfolders in a data/ folder, and those outputs are tracked by dvc (as well as the commands used to run the script, and any inputs). Data is stored in any number of formats - could be .csv or .sqlite or a binary format that Julia can directly read like BSON.

In dvc, we mark the inputs and outputs to each particular stage, and dvc then builds up a DAG which charts the computational pipeline. If some intermediate stage has changed, dvc detects that change and re-runs anything that it depends on (as you specify).

3 Likes

I’ve summarised my own workflow here in the past. It basically comes down to minimizing performance problems & maximizing convenience/adaptability for me. Version management is still done by hand though, but all the fancy stuff your editor does for you anyway should work out of the box. I definitely use git repositories for every module though.

I don’t usually use submodules. If it’s important enough to factor out into a submodule, it’s usually important enough to make it its own thing.

Interesting! How does it work precisely?

Could you tell me which parts are (the most) unclear? Then, I’ll elaborate on those

Wow, sounds just fantastic! Maybe that is what I am looking for.
Thanks! going to try on the weekend :slight_smile:

I was just confused about how LaTeX or Pandoc can pick up the output?
What kind of output?
Are you creating plots with PGFPlots? or maybe generating the text of your papers with Julia :slight_smile: ?

1 Like

Dvc sounds cool. What do you think about it? I dislike it when tools work for the most part but don’t allow me to do fine-tuning (for example, when a tool manages citations but sometimes does it wrong). Does that happen with dvc?

Edit: And is it difficult to setup / maintain? I ask because I see mentions of databases, config files and cloud storage.
Edit2: Do you also know how it differs from Pachyderm?

Ah, yes. Plots and tables. Latexify.jl has some nice conversions from DataFrame to other formats

I find it easy to use because it’s built on top of git and very agnostic to your programming language, tools, etc. If you are currently running scripts which read some files and write to others, and that chain of computations can be described by a DAG, then it will work for you.

How you structure experiments (by different git branches, or by commits, or just in the file structure) is up to you, although they provide some tooling for tracking metrics and comparing different branches/experiments. 90% of use cases will be covered just by pull/push/add/run/repro, kind of like how 90% of version control needs are covered by a few git commands.

It contrasts to other ML pipeline tools because it is less opinionated and less tied to a particular workflow/programming language. This was the case when I surveyed the options about 1 year ago, and others may have developed since then. I think some people may even use dvc in conjuction with other workflow tools, because it is relatively non-intrusive.

I don’t find there to be any strongly negative aspects to it, but it does keep you honest about any dependencies you mark and changes to those dependencies. Some people also find the one experiment/one branch idea to be too heavy, but I find that if you’re precise about what an “experiment” is then it’s not a problem.

1 Like

Indeed, dvc looks great and easy to use.
I read the documentation and watched the tutorials yesterday. And I noted a few potential issues, but there might be a simple solution to it:

  • not clear how to deal with advanced plots produced as results of the script, should one treat them as models and data tracked by dvc?
  • I had an impression that dvs assumes one pipeline per project. Wondering how one deals with several ones (separate folders with dvc init? how to navigate in between?)
  • it seems that repro and the pipeline requires starting Julia for every step. Sometimes with Julia, it is good to keep running the session.

@platawiec, could you please share your experience?

Sorry, some of the questions were too simple

An idea in CML is to reproduce them in the push pipeline

dvc repro takes the name of the pipeline as an agrument, so it is all scalable

I would love to see some examples already working with julia. Is there anything public?

If you have time, could you also give a bit of a high-level description?

I’m looking at a tutorial too and it seems like it’s just Git but with metadata files pointing to the data. Then, I don’t understand why the DAG is needed.

Ah, because DVC assumes the files to be too big to diff, so you cannot merge branches back into the main branch.

Yes, this is more of a script-based workflow for long computations that need to be tracked as part of a report or other deliverable. For rapid prototyping I use a Pluto notebook to try out new ideas/plots etc, then migrate it into the dvc pipeline as it finalizes.

Something like GitHub - dmolina/DaemonMode.jl: Client-Daemon workflow to run faster scripts in Julia would probably work to maintain a Julia session across runs within this framework, though I haven’t tried it myself.

When I say “DAG”, I’m referring to the computational pipeline that is the user specifies. A common pattern I follow is something like:

[data/initial-data]
-> 
(data_processing_script.jl)
->
[data/processed-data]
-> 
(data_plotting_script.jl)
->
[data/plots]

Where [] is denoting some folder containing files (input/output) and () is denoting a computational step. In general a script may input from multiple stages, but the pipeline always satisfies the properties of a directed acyclic graph (DAG). In this case, we just have a two-stage computation. Now, if I change the code in data_plotting_script.jl, dvc will recognize that the code is different and re-run only that stage. In contrast, if I change data_processing_script.jl or the contents of the data/initial-data folder, then it will re-run the chain up to the point where I made a change.

There is some overhead to this process! I need to specify the stages, make sure the scripts run, pay the julia start-up time cost, and think about the staging and logical boundaries of the computation. That’s why (as I mentioned above) I usually prototype it out in a Pluto notebook or through a test-driven development process in my IDE. But, if you just need it to store your data with your git commit, I find it works for that too.

2 Likes

How do you deal with the config files?

  • TOML, Yaml, or julia scripts?
  • Several nested or a single with const hyperparameters?
1 Like