Hi all,
I’m writing data analysis scripts. Their form is usually like:
using Plots, XLSX, DataFrames
# load some data
# drop mangled parts
# calculate some statistics
# plot some stuff
plot(x, y,
title="....",
xlabel="...",
ylabel="...")
png("savefig.png")
I’m wondering what people do as they’re developing analysis scripts. What are your workflows that work for you?
Workflow 1
I currently intentionally do everything in Main
. My scripts and their outputs are meant to be read by the humans on my teams, they’re not components in larger apps, and the extra nesting would be distracting. And doing everything in Main
plays nice with Literate.jl (I don’t think Literate even works without working in Main
, since it needs to intercept all the implicit display()
calls, right?) when I’m ready to start publishing results.
I’m coming from python and there I make heavy use of python -i
. There, I also write everything in the global scope, run it with python -i
, and then I have my datasets in the REPL to explore. When I write a line I like, I copy it into my script and re-run it.
A big advantage of this workflow in my experience is that the code is precisely what it says. I can hand my script to someone else and they will be able to run it. There’s no tricky IDE-specific dependencies that can creep in because there’s only one entry point.
Julia also has -i
which is great. My workflow translates directly.
But this workflow is kind of slow because it has to recompile everything from scratch every time I run my scripts. Python has slow start up too but Julia is quickly galloping ahead the more libraries I pull in, and the wait between iterations breaks my flow.
Workflow 2
I know about Revise but it doesn’t update variables, which is a big problem when my work is analysing data. Naively, that would force me to change the way I write my analyses, maybe pushing my towards a functional style where literally every step has a function. That would be clean, but sort of unnatural for the sorts of semi-interactive explainer/exploratory work I’m trying to do (again: with working teams. I can’t code anything far out that needs an appreciation for the y-combinator).
The best I’ve found is this tip: (I don’t have link rights yet but this is the citation: hxxps://discourse.julialang.org/t/critique-my-workflow-for-small-models-in-julia/105089):
# projet.jl
function main()
@eval Main begin
df = DataFrame(XLSX.readtable("data.xlsx", "pages"))
HIVER = [1, 2, 3, 4]
ÉTÉ = [5, 6, 7, 8]
AUTOMNE = [9, 10, 11, 12]
end
end
main()
This, cleverly, runs the code in main() in the Main
scope instead of in its own local scope.
To work with it I have to launch it like:
julia> using Revise
julia> includet("projet.jl")
julia> HIVER
4-element Vector{Int64}:
1
2
3
4
I don’t get fully automatic updates when I change code, but I can rerun main()
without losing my Julia session. For example, suppose I decide December should count as winter; I just edit it in my editor
# projet.jl
function main()
@eval Main begin
df = DataFrame(XLSX.readtable("data.xlsx", "pages"))
HIVER = [12, 1, 2, 3]
ÉTÉ = [4, 5, 6, 7, 8]
AUTOMNE = [9, 10, 11]
end
end
main()
And it’s a snap to rerun it because I save all the time spent compiling library functions.
julia> main()
julia> HIVER
4-element Vector{Int64}:
12
1
2
3
I like this because:
- it’s fast
I don’t like:
- that I have to adjust how I launch the code. It’s extra cognitive load on me and on teaching others
- that I have to adulterate the code
- that it doesn’t work with Literate.jl
Workflow 3
Literate.jl suggests this build harness when working with Revise:
using Revise
using Literate
entr(["analyse.jl"]) do
try
Literate.markdown("analyse.jl", "build", credit=false, execute=true, flavor=Literate.CommonMarkFlavor())
catch e
@warn "build failed:" exception = (e, catch_backtrace())
end
end
This is fine but it’s getting pretty far from the simplicity of julia -i
.
I like this because:
- it’s fast
I don’t like:
- it uses a different different entrypoint than normal, which makes me worry about the risk of a dev/prod gap
- though in the case of Literate, there already is a different entrypoint
- and if I need to examine variables I can
@bp
to drop in. - but it feels a bit awkward to me.
Workflow 4
I experimented with PackageCompiler.jl:
julia -e 'using PackageCompiler; create_sysimage(["Plots", "XLSX", "Distributions", "Statistics", "DataFrames", "Literate", "MarkdownTables"], sysimage_path="sys.so", precompile_execution_file="precompile.jl")'
Then iterate with:
julia -J sys.so -i projet.jl
I like this because:
- I use the same, unadulterated entrypoint as I would by default. There’s no need to adjust my code at all to play nice with the dev environment and it should run identically for anyone I share it with.
I don’t like this because:
- it (potentially) requires a new sysimage for each project
- building a sysimage is REALLY heavy; it takes something like 2GB of RAM and over half an hour on my machine. Which especially makes it hard to iterate on it even though:
- getting precompile.jl right is tricky
This is something I would invest in for deploying to a cluster to crunch some numbers, or maybe to work on a bunch of related projects. It’s just, the time it takes to compile puts me off exploring it.
Workflow 5
I read (hxxps://discourse.julialang.org/t/output-doesnt-show-plots/111092/14) that the vscode plugin has a magic REPL that allows highlight-to-run code.
I imagine struggling to be comfortable in this:
- it depends on a specific IDE; I use vscode too, but I also like to be able to work with notepad / gedit / vi in a pinch.
- it allows running code out of order, which just confuses me and leads to unreproducible results and/or bugs
Workflow 6
I guess people use Jupyter? Which holds the Julia session open as its “kernel”?
This also allows (even encourages) running code out of order.
What does every do to develop their scripts? I can’t be the only one struggling with analysis scripts. There’s a lot of people doing science and engineering in Julia and I wonder what you all do with your “I’ll just whip it up in matlab” instincts.
Thanks for sharing your tips!