Can Julia really be used as a scripting language? (Performance)

I’ve been reading the docs, and getting pretty excited about Julia’s potential as a replacement for Ruby in my workflows, however I very quickly seem to have run into a brick wall once trying to actually use it: it’s incredibly slow.

This is sort of surprising to me as I’m using only versions greater than 1.0 and have now switched to 1.4.1 (latest on downloads page). At first the REPL loading was a problem, but 1.4.1 fixes that. My current test case is a two-line script import CSV and then the for loop over CSV.File and doing nothing in the loop. I have a 4-line CSV file I’m testing with. The script takes almost 30 seconds to run on my desktop computer.

Is this just a case of my use case not being aligned with the purpose of Julia? I know the history here is for long-running “data science” type operations that take days, etc. The REPL speed increase from ~1.0 to 1.4.1 gives me hope that the scripting use case is in fact interesting to someone other than me, but I’m basically wondering how much. Like, could I start using Julia today and hope than in a few years it’ll be usable, or is it just such antithesis to the main purposes/uses of the environment that I should be looking elsewhere?

5 Likes

It’s well known that Julia needs to compile the functions at the start. For scripting, the best that can be done is to use the lowest optimization level julia -O 0 and see if that helps. But I do know ppl who use Julia for scripting, so they may have other ways to improve the exp as well.

The other way is to compile binary using PackageCompilers.jl but that may not be possible.

2 Likes

It might be wroth trying to precompile your packages.
In the REPL, type ]precompile, which will take a minute or two, but might speed things up afterwards.

1 Like

Well, here is some data:

pkonl@TeenyTiny MINGW64 ~/Documents
$ time ~/AppData/Local/Programs/Julia/Julia-1.5.0-DEV/bin/julia.exe -O0 do.jl
a=missing, b=1, c=1.0
a=missing, b=2, c=2.0
a=missing, b=3, c=3.0

real    0m13.356s
user    0m0.015s
sys     0m0.015s

pkonl@TeenyTiny MINGW64 ~/Documents
$ cat do.jl data.csv
using CSV

for row in CSV.File("data.csv")
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
a,b,c,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true

With Julia 1.5, precompiled.

2 Likes

Thanks for that. So it’s not just me or my settings, even doing everything “right” on the newest version it’s very slow. Which is fine for many use cases of course, I’m not trying to be judgemental here, I’m just curious if this is considered an issue that should be worked on eventually, or just “the cost of the way we do things”.

Or maybe the CSV library is just very big and this isn’t often an issue? Hello world isn’t too slow (about the same speed as ruby, faster than runhaskell in my tests)

1 Like

If you do not anything fancy but just to read CSVs then you can use DelimitedFiles.readdlm that exists in the standard library (i.e., you do not need to install any package, just import it). Yes, there were other people using CSV.jl that had the same problem with startup problems recently.

Also, I run my Julia scripts with options -O0 --compile=min, you can even use --compile=no. The flag --help-hidden also shows a --trace-compile to help you find why it takes so much time to start. However, even with all of this, “first time to plot” is a known issue that is being fought by Julia developer team. My workaround for exploratory data analysis is using a Jupyter notebook with package Revise imported and letting it open all the time, so after the first calls to each function, the code runs blazing fast.

7 Likes

Holy shit, thank you! This cuts my time down from 30s to 2.5s with just this switch. It’s not great, but it’s viable for sure.


Switching from CSV to DelimitedFiles is probably ok for my case, and gets me to sub-second so seems worth it for now. Thanks for the tip.

12 Likes

For a related use case (call a simple Julia script repeatedly), where the start-up and compilation time would dominate the actual runtime, I’ve thought about using an implicit client/server approach to reuse a process.

This is inspired by Emacs Server, which allows to run emacsclient <file> in a shell, but actually open the file in the long-running Emacs instance. This works by starting an Emacs Server on start-up of Emacs, then using the emacsclient instead of emacs for later calls.

I imagine the following (simpler) workflow with Julia: There is a bash script juliaserver that can be used with juliaserver script.jl. The first time it is called, it will open a separate background process and use that to execute the file (using include). The next time, the existing process will be detected (e.g. by locating a specific file) and a short message with the call arguments is sent there, to be executed.

There are some questions, still, such as: will the execution of later calls be influenced by previous calls, because of remaining variables in scope, or invalid redefinition of types? How long should the process be kept running idle? What environment should be active? So, it’s not quite obvious how to decide on the details.

5 Likes

I think it’s both. It is a known cost of how Julia works, with its jit compiler and aggressive type specialization. But it is also a very actively talked about issue, that has a great part of the core devs’ attention. I believe that compiler performance is close to the top of their list of priorities.

2 Likes

Recently @kristoffer.carlsson gave a nice webinar on PackageCompiler. I’m not sure if it was recorded and is still available.
But that showed that how the startup issues can be reduced significantly.
If you know you need to use CSV.jl in a scripting environment, it might be worthwhile to add that to your julia sysimage. And launching that sysimage as your scripting environment. That should get the launching of Julia and the using CSV steps to be sub-second.

The downside is some setup work and more manual version stepping of the packages, but for your scripting environment, you probably want stable versions and don’t want things to change every few days.

Further speedup can be obtained by compiling the specific functions needed, but depending on the complexity of the code that gets JIT compiled, that might not even be necessary.

3 Likes

You know you are basically describing a CLI for Jupyter notebooks, no?

3 Likes

That’s a good point. Maybe the Jupyter infrastructure can be reused for this, rather than building something from scratch with Sockets etc.

But it should be more implicit, with the kernel started automatically, and each new call using a fresh session, if possible. So we would only want to avoid recompilation, not keep any state between calls.

Well, the kernel being started automatically can be done by a wrapping script that checks if it is running and if it is not, then starts it. The call does not need to be in a fresh session, it just needs to be wrapped in a function or some other scope to not leak variables unless you are really worried about some kind of global state.

I wonder if more of the compilation could be cached. If a precompiled
sysimage could speed it up, could the parts for that for each module not
be created and cached automatically? I believe Guile does something like
this.

In any case, I think the compile=min, while not fast, is fast enough for
now especially if there’s hope for it to get even better with time.

On the better with time point, see here. If you want to read lots more about this, search “time to first plot,” mentioned above, which is a shorthand for this issue (plotting packages are notorious for long compilation times).

But I would also encourage you to explore other workflows. I come from bioinformatics, where everything is a script, so I totally get the inertia. But now I use the REPL and Atom, for most development, and only write scripts for long running precesses where compile time is a tiny fraction. Think about it this way: when you’re coding interactively, use the interactive tools.

A couple workflows have been mentioned here, there are also great ways to use Atom or VS code as development environments. I usually start up Julia first thing, run a script that loads my packages (esp. Revise.jl), get a cup of coffee, and then don’t worry about compilation for the rest of the day.

4 Likes

I totally understand that for the “data science” use case Julia is advertised for, this is reasonable. If I want to ship a script to users, though, it needs to work as a script :slight_smile:

2 Likes

Perhaps automatically wrap each script in a module?

1 Like

I can wrap my code in a module. What would that accomplish? Most of the
code is in dependencies which are modules already.

I was suggesting that as a solution to give each script their own namespace if run on a hypothetical long running julia-server.

That serve approach would have the advantage of only needing to compile the dependencies once. As long as it keeps running as a background process, subsequent runs of the script won’t have to recompile.

That won’t help you if the folks you’re sending it to only run it once.

Maybe somebody could distribute a “batteries included” Julia binary that uses PackageCompiler to allow a number of the most popular packages to start up instantly.

(This would be aimed at people who simply want to run scripts that they get sent to them, and who are not bothered by not having the very latest versions of packages.)