A Julia DataAnalysis Sysimage from PackageCompiler It's so easy you should do it too!

If you want to use Julia for the kind of thing where you might fire up R and read a couple CSV files into some DataFrames, maybe grab some data from a SQLite file, manipulate the data a little, make a few plots, and be done… Then it’d be nice to have a quick-load sysimage so you don’t have too much “time to first plot”. It turns out that PackageCompiler has gotten to the point where this is fairly trivial. Here are two scripts I’m using to build my own sysimage that can do all these things, including RCall.

Here is the script I’m using as the precompile_execution_file. This file does some “example” tasks to help PackageCompiler figure out what needs precompiling.

using StatsPlots, CSV, DataFrames, DataFramesMeta, SQLite, GLM, Optim, Dates, RCall

df = DataFrame(x=rand(100),y=rand(100),z=Date(2000,01,01) .+ Dates.Day.(rand(Int,100).% 100))

CSV.write("testfile.csv",df)

df2 = CSV.File("testfile.csv")

p = @df df plot(:x,:y)
@df df plot!(:y,:x)
display(p)

p2 = @df df scatter(:z,:y)
@df df scatter!(:z,:x)
display(p2)

h1 = @df df histogram(:x)
@df df histogram!(:y)
display(h1)

d1 = @df df density(:x)
@df df density!(:y)
display(d1)

ols = lm(@formula(y~x),df)
display(ols)


@chain df begin
@subset :x .> .5
@subset :y .< .5
@orderby :x
@transform :p = 2 * :x
end



@rput df
R"library(ggplot2); p = ggplot(df) + geom_point(aes(x,y)); print(p)"

db = SQLite.DB("foo.db")
SQLite.load!(df,db,"foo")
df3 = DBInterface.execute(db,"select * from foo where x > ?", (.25,)) |> DataFrame
df4 = DBInterface.execute(db,"select * from foo") |> DataFrame


And to build the sysimage:

using PackageCompiler
ENV["PYTHON"]="/home/dlakelan/miniconda3/bin/python"
using Pkg
Pkg.build("PyCall")

create_sysimage([:StatsPlots,:CSV,:DataFrames,:DataFramesMeta,:SQLite,:GLM,:Optim,:RCall],sysimage_path="sys_dataanalys.so",precompile_execution_file="dataanalys.jl")

after running all that, and about a minute or two later… I’ve got sys_dataanalys.so so I can do:

julia --sysimage sys_dataanalys.so

and then doing data analysis is quick!

27 Likes

Also vscode

https://www.julia-vscode.org/docs/dev/userguide/compilesysimage/

4 Likes

I have made a sysimage with vscode, and it’s dead easy, and works fine, even with a lot of packages in the project. Does anyone know of a way of providing a custom precompile file to the vscode image build process?

How do you deal with the fact that you can’t install or update anything after compilation? Also, do you type the long command with the sysimage path every time or do you make some alias to it so it’s more convenient?

If you make the sysimage with packagecompiler, you can still save it in the folder with your project.toml as JuliaSysimage.dll (or JuliaSysimage.so for Windows) and vscode will detect it when launching its special repl. Of course that will only work for the the projects(s) where you’ve saved the sysimage.

I’m not sure if that’s possible at the moment. @davidanthoff ?

I don’t think that’s right. You can absolutely Pkg.add(“Stuff”) and using Stuff just like with vanilla Julia, but it won’t be part of the precompiled things, so you pay the precompilation time.

If you decide you want some additional packages precompiled into the image, adjust the two scripts to include those packages and some code that exercises the functionality, and rebuild.

2 Likes

Just that adding usually updates dependencies if you don’t pass some flag. I guess I need to try it out :slight_smile:

This is what I thought too, and it was easy enough that I posted this to encourage people. I’ve been pretty frustrated with the time to first plot when you need to read / write some data and do some manipulation before getting the plot in particular (since you pay compilation time for CSV, DataFrames, SQLite, Plots, KernelDensity, GLM etc etc). The thing is at least for me, it’s often very similar types of quick analyses I want to do, so a precompilation execution script that simulates that kind of process is all you need to make sure you get most things prebuilt.

I just use a script called “juliadata”

#!/bin/sh

julia --sysimage /home/..../mysysimage.so
2 Likes

Thanks for the useful example.

What’s the purpose of

Pkg.build("PyCall") ?

I think I got an error telling me to run that :joy: I don’t remember exactly.

I can’t emphasize how much this has been a boon to my data analysis projects. I run “juliadata” instead of “julia” and when I do any kind of typical “using” statement such as “using StatsPlots” or “using DataFrames” or any of that… it just immediately finishes! Saved me many many minutes and much frustration!

2 Likes

Since “It’s so easy you should do it too!” I tried it, and think this should be included in default Julia install, see below. Way back I tried PackageCompiler as an experiment (it worked, but was so slow that I’ve avoided it since, since I don’t really need it). I thought it was faster for you.

Heads up, PackageCompiler doesn’t work in Julia 1.8 (a known bug, I went down a rabbit hole investigating, see other thread), does work in 1.7-rc1.

Are you being serious, since for me took 14.8 min. (while worth the wait):

(@v1.7) pkg> activate --temp
  Activating new project at `/tmp/jl_kes6Xk`

(jl_kes6Xk) pkg> add StatsPlots, DataFrames, SQLite, GLM, Optim, CSV, DataFramesMeta

julia> @time create_sysimage([:StatsPlots,:CSV,:DataFrames,:DataFramesMeta,:SQLite,:GLM,:Optim,:RCall],sysimage_path="sys_dataanalysis.so",precompile_execution_file="dataanalysis.jl")
[ Info: PackageCompiler: Executing /home/pharaldsson_sym/discretionary_dash/dataanalysis.jl => /tmp/jl_packagecompiler_poaCpQ/jl_14W9XW
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
[..]
[ Info: PackageCompiler: Done
[ Info: PackageCompiler: creating system image object file, this might take a while...
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libstdc++.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libstdc++.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/pharaldsson_sym/julia-1.7.0-rc1/lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
888.670988 seconds (2.88 M allocations: 165.100 MiB, 0.01% gc time, 0.05% compilation time)

$ ~/julia-1.7.0-rc1/bin/julia --sysimage=sys_dataanalysis.so

 Downloading artifact: MKL
    Downloading [==========================>              ]  63.5 %

Did you get similar output including MKL…?

There’s no purpose in that context. It’s not even built into the sysimage. I would also consider rather (or both) PythonCall.jl which has pros and cons, I haven’t used it, but I believe the major con is the slow startup that a sysimage would eliminate.

I’m not suggesting a sysimage with these packages should be distributed with Julia or as a separate alternative one. That would make the download larger. But a script to make one is tiny and could be included. Preferably with a hint on startup how to make it and use it. That seems even better than just documenting somewhere where people may not read. Most users are also going to miss this thread on discourse…

It might be controversial what to add into a sysimage, e.g. RCall. I’m not really opposed to it, it seems like a nice package, but ideally all dependencies should be taken care of. I’m not sure we want to suggest R and Python supported, or their packages, and pip, which is the real benefit. Those and some other packages could be included in the script commented out.

What’s your reasoning for including RCall? Just helpful in general or just for ggplot2? I understand it’s a nice package, but it it better than some Julia alternative, Plots.jl or e.g. Makie.jl? I’m guessin you just use it out of habit.

For your script, all of it worked the first time around, in Julia 1.8 (while slow at first), just not to make a sysimage. In 1.7 I had to comment out the last to lines and I can’t close the plot now (I’m pretty sure I could before, and I assume this is independent of the sysimage):

julia> R"library(ggplot2); p = ggplot(df) + geom_point(aes(x,y)); print(p)"
┌ Warning: RCall.jl: Warning in (function (display = "", width, height, pointsize, gamma, bg,  :
│   locale not supported by Xlib: some X ops will operate in C locale
│ Warning in (function (display = "", width, height, pointsize, gamma, bg,  :
│   X cannot set locale modifiers
└ @ RCall ~/.julia/packages/RCall/iMDW2/src/io.jl:160
RObject{VecSxp}

I think everyone should make their own sysimage with the packages they want. I actually haven’t used RCall or PyCall even once since using this, but I had a project where I thought I would need to use some R functions (making Kaplan Meier plots, for which I couldn’t find a Julia package) so I threw it in there. I may remove it.

What has been a huge boon is CSV, DataFrames, DataFramesMeta, StatsPlots, Turing, and some others. That’s like bread and butter for many of my projects.

How long does it take? I don’t remember really, but it’s short enough that I didn’t worry too much. I don’t think 15 minutes, but then I had a lot of stuff installed already in my global environment, so it didn’t need to download or install anything.

Appreciate if someone can share a script to create a sysimage containing the Makie.jl stack!

Just use this script except add GLMakie to this part:

create_sysimage([:StatsPlots,:CSV,:DataFrames,:DataFramesMeta,:SQLite,:GLM,:Optim,:RCall,:GLMakie],sysimage_path="sys_dataanalys.so",precompile_execution_file="dataanalys.jl")

and then in the “exercise” script,

using StatsPlots, CSV, DataFrames, DataFramesMeta, SQLite, GLM, Optim, Dates, RCall, GLMakie

...

## do some makie plotting here to exercise the commonly used plots you often use

Everything else should work.

1 Like

I think I tried it before with a similar approach and it didn’t speed up the plots. If someone can share a script that is working on a different machine, I can retry here.

I haven’t delved into Makie yet but it should work. Make sure you do the plot types that you plan to use most in the exerciser script

To make this even easier for people, I’ve got a repo set up where you can start with a working project.

Just clone the repo:
https://github.com/dlakelan/JuliaDataSysimageMaker

then in buildimg.jl go in and change the output path for the sysimage to match where you want it on your machine…

then

julia buildimg.jl

and go to lunch or get some coffee. It takes like 10 mins to build on my machine.

If you have additional packages you’d like to include and additional functionality you’d like to exercise before the build, then add the packages to the buildimg.jl file package list, and edit the dataanalys.jl “exerciser” file.

You can install the command juliadata which is a shell script, edit the path to your sysimage in that file before copying it to your ~/bin/ directory then you can run

juliadata ....

and it’ll work just like julia except with your precompiled packages.

If you don’t have the packages installed in the “root environment” then you’ll get warnings that seem to be inconsequential when you start it up. It all seems to work fine though.

All this tested on Linux, but not Windows or Mac.

9 Likes

@juliohm, to get total speed up you have to create a precompilation script that imports Makie and runs a plot command. This will guarantee that everything is compiled in the Sysimage. Let me see if I find my script.

1 Like