An exploratory workflow with a main.jl file which calls different scripts with include is a very good / acceptable workflow.
If you are only working with a single file then you want to use Revise.jl’s includet, which means you dont have to run include all the time.
But you should really be trying to put everything in functions.
Here’s what I do.
Start out with main() at the bottom of your script.
Use includet on that file and call main() when you want to run everywhint
Write everything in main, when it gets too big, break things into smaller functions that are called within main
Use Exfiltrator.jl extensively in debugging, as @exfiltrate; error("") when you want to break and send everything in that function’s local scope to global scope for inspection at the REPL
Use dictionaries, named tuples, and custom structs to store information. Pass large objects to the functions and use @unpack to work with just the things you need. This means you don’t have to agonize about what you are and are not passing to your sub-functions.
Just run main() all the time. If you use includet then everything will always be up to date. It’s very freeing.
Bold of you to assume I only have one workflow, XD.
If it is just a test/prototype that I can end up throwing in the trash by the end of the day I just create folder, active an environment, install the necessary packages and use a single jl file with everything inside main. If small enough I do not even start with git.
An intermediary case is data analysis for a paper. In this case the git repository is my PhD repository where I store all papers I am working on. I just create a subfolder in the latex folder of the paper and use Jupyter.
If it is something I will dedicate some time (weeks, months, …) and is not just data analysis, then I already start by using PkgTemplates to start a package (I do not remember if this already makes it a git repository, but I make sure it is a git repo). Does not matter if I intend to register it after or not, I do because being a package allows me to dev it in the general environment and import it in short scripts I may write.
If (1) grows it becomes (3) and if (3) grows I start dividing it in submodules. Basically, I use CameCase for the .jl files that are a submodules (separation of namespace) and snake_case for the ones that are not submodules (i.e., that are just part of a module/namespace but separated for ease of search and reading). If a submodule is not a single file (i.e., it is broken in sub-submodules or has code separated in multiple snake_case.jl files) then they get their own folder (inside src), recursively. snake_case.jl are included a single time inside respective modules. CamelCase.jl are often all included also a single time but at the start of the main package module, and if they have dependencies for each other they just import by means of the parent module (e.g., MyPkg has submodules A and B, that are included just after module MyPkg line, and if B needs something from A, then inside module B I do import ..A: name_of_function_needed).
This is a constant struggle for me too. I’m starting to become more and more a fan of starting with creating a package straight away and create everything from the REPL. Then, tools like LaTeX or Pandoc can pick the output up. For me, scripts will always lead to lots of code duplication and the problem of naming the scripts. Simply put: the benefit of just having a package is that all the logic is split into functions instead of files, so it is more versatile. Added benefits are the included tools for
package version control (via [compat] or Manifest.toml)
Also, packages work well with the REPL and Revise.jl. With scripts, I’m always having to manually ensure that the state of the loaded code is the same as the state of the running code. Finally, working in package has standard solutions for most of your suggestions too because package developers also have to deal with writing test, scaling and passing parameters between different parts.
For me, running time is not really a thing, so I would just ensure that I can reproduce it by going back in Git’s history.
But, as I said, I’m also struggling with this so there might be better approaches.
It is all clear about (1) and (2).
The case (3) is what I am interested in the most. My projects are of several months long.
What you do with modules and submodules sounds interesting. What is the use case for the submodules?
I often struggle with placing functions of different generality in modules. For most of the functions, I have a general version and one, less general that takes some const object in. Maybe it is exactly the use case for the submodule … (?)
We use dvc.org in our tech stack. This is a tool built on top of git that versions your data, controls data provenance, ensures reproducibility, and maintains your computational pipelines.
Generally, our file directory has a scripts/ folder which contains scripts that call functions from a module defined in src/. The outputs of those scripts gets stored in subfolders in a data/ folder, and those outputs are tracked by dvc (as well as the commands used to run the script, and any inputs). Data is stored in any number of formats - could be .csv or .sqlite or a binary format that Julia can directly read like BSON.
In dvc, we mark the inputs and outputs to each particular stage, and dvc then builds up a DAG which charts the computational pipeline. If some intermediate stage has changed, dvc detects that change and re-runs anything that it depends on (as you specify).
I’ve summarised my own workflow here in the past. It basically comes down to minimizing performance problems & maximizing convenience/adaptability for me. Version management is still done by hand though, but all the fancy stuff your editor does for you anyway should work out of the box. I definitely use git repositories for every module though.
I don’t usually use submodules. If it’s important enough to factor out into a submodule, it’s usually important enough to make it its own thing.
Dvc sounds cool. What do you think about it? I dislike it when tools work for the most part but don’t allow me to do fine-tuning (for example, when a tool manages citations but sometimes does it wrong). Does that happen with dvc?
Edit: And is it difficult to setup / maintain? I ask because I see mentions of databases, config files and cloud storage.
Edit2: Do you also know how it differs from Pachyderm?
I find it easy to use because it’s built on top of git and very agnostic to your programming language, tools, etc. If you are currently running scripts which read some files and write to others, and that chain of computations can be described by a DAG, then it will work for you.
How you structure experiments (by different git branches, or by commits, or just in the file structure) is up to you, although they provide some tooling for tracking metrics and comparing different branches/experiments. 90% of use cases will be covered just by pull/push/add/run/repro, kind of like how 90% of version control needs are covered by a few git commands.
It contrasts to other ML pipeline tools because it is less opinionated and less tied to a particular workflow/programming language. This was the case when I surveyed the options about 1 year ago, and others may have developed since then. I think some people may even use dvc in conjuction with other workflow tools, because it is relatively non-intrusive.
I don’t find there to be any strongly negative aspects to it, but it does keep you honest about any dependencies you mark and changes to those dependencies. Some people also find the one experiment/one branch idea to be too heavy, but I find that if you’re precise about what an “experiment” is then it’s not a problem.
Yes, this is more of a script-based workflow for long computations that need to be tracked as part of a report or other deliverable. For rapid prototyping I use a Pluto notebook to try out new ideas/plots etc, then migrate it into the dvc pipeline as it finalizes.
Where  is denoting some folder containing files (input/output) and () is denoting a computational step. In general a script may input from multiple stages, but the pipeline always satisfies the properties of a directed acyclic graph (DAG). In this case, we just have a two-stage computation. Now, if I change the code in data_plotting_script.jl, dvc will recognize that the code is different and re-run only that stage. In contrast, if I change data_processing_script.jl or the contents of the data/initial-data folder, then it will re-run the chain up to the point where I made a change.
There is some overhead to this process! I need to specify the stages, make sure the scripts run, pay the julia start-up time cost, and think about the staging and logical boundaries of the computation. That’s why (as I mentioned above) I usually prototype it out in a Pluto notebook or through a test-driven development process in my IDE. But, if you just need it to store your data with your git commit, I find it works for that too.