Help to get my slow Julia code to run as fast as Rust/Java/Lisp

Hi there!

I had included some Julia code in a benchmark I’ve run to see how it performs compared to a few other languages. Given what other people say, Julia should be very fast, but the program I wrote is much slower than e.g. Rust, Java and Lisp, so I supposed I’ve made a big mistake somewhere :smiley: .

The program is a port of the Common Lisp version which you can find here.

My Julia port is here.

Notice this is my first ever Julia code! So I am sure I made lots of mistakes. But I would very grateful if anyone could show me what can be improved in order to make the code run as fast or faster than other programming languages.

I tried a few optimisations like replacing the arrays in charToDigit with tuples but that didn’t help much.

If you’re curious, this problem has been the subject of two blog posts I wrote (links in the README on GitHub as I can’t post more links here it seems):

  1. Revisiting Prechelt’s paper and follow-ups comparing Java, Lisp, C/C++ and scripting languages

  2. How to write really slow Rust code

I only briefly mention Julia in those because I didn’t have time to verify and optimise the code yet, hence why I am here now. It would be fun to write a new blog post “showing how to write slow Julia code” similar to the Rust one, where the improvements to my initial version of the Julia code can be explained, and hopefully Julia programmers can benefit from learning where they need to be careful regarding performance. As in the Rust post, I obviously will give full credits to anyone helping!

Thanks for any assistance!

EDIT: a few contraints of the problem for those who don’t have time to read the full requirements:

  • input phone numbers are encoded in ASCII but may contain non-digits that can be discarded.
  • dictionary words also are encoded in ASCII and again, may contain non-alphabetic chars that can be discarded.
  • both numbers and words MUST be printed as they are in the original files (cannot discard non-valid chars when printing them).
  • maximum phone number lenght is 50 (don’t replace BigInt with Int64 for example, it won’t fit).
  • sysout can be buffered but the program must print all solutions.

Please ask if anything else is unclear.

1 Like

I’ve only cursory looked at the implementation when your followup post hit lobste.rs a few days ago, but what I noticed was that your benchmarking script repeatedly starts julia, leading to recompilation of all code. That’s included in the benchmarking time, as far as I can tell. I know that you’ve put julia under “scripting languages”, but you should know that (although it can be used like one), julia really is a (just ahead of time) compiled language.

Another point is that strings in julia are immutable, so creating a lot of them is pretty much guaranteed to slow down performance dramatically. A quick glance also tells me that you should maybe use modifying operations for your BigInt operations, and only copy when really necessary. +, * etc. by default copy, though there is an in-place API in Base.GMP.MPZ (I think that’s what it’s called).

4 Likes

I am afraid it might by difficult to help as we do not have the files tests/words.txt and tests/numbers.txt. Please make sure that the file with which you need some help can indeed be run by others.

3 Likes

Why don’t you have those? They are in the same repo.

EDIT: https://github.com/renatoathaydes/prechelt-phone-number-encoding/tree/julia/tests

Also, please check ./benchmarks.sh to know how i run the actual benchmarks if you really want to go that far.

@Sukera I couldn’t find out how to compile Julia code to binary (or whatever it compiles to). Can you post a link that can help me with that?

I see. At first I was just focused on the code in that single file (I just copy&pasted the code directly) and it did not occur to me that I must get the whole repo to make it run.

am I mistaken that benchmark_runner would run julia script.jl multiple times? Which means you’re measuring compile time every time?

I’ve done that yesterday :slight_smile:

This eventually calls

"julia src/julia/phone_encoder.jl"

via your benchmark_runner rust program, which just spawns the julia process and lets it run from scratch.


It’s a minor point to keep in mind once the pure algorithm has gotten faster, but it’s probably not insignificant nonetheless.

@jling yes, that’s part of the benchmark: it must run from the CLI.
If I can compile Julia first, please let me know!

not worth the work, just benchmark:

If you really want:

tbh if your work load is kind of light and this is the requirement, I don’t see why you would want to include Julia… this is like including the time to start MATLAB or something

5 Likes

In that case it’s going to be slow, no matter what.

There are also a lot of problems with the code itself, since you are new to the language, but as long as you are running it from the CLI, you have to spin up a Julia process (slow), compile the code (slow) and run it (fast).

As for your code itself, you can read the Performance Tips · The Julia Language

In particular, avoid global variables (like dict), and avoid creating a lot of arrays needlessly. Also BigInt is slow.

Is the end-product here just printing the numbers? In that case, can’t you just print digit-by-digit, instead of putting stuff in arrays?

2 Likes

In addition to other suggestions above
https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/julia/src/julia/phone_encoder.jl#L44-L54

julia> using BenchmarkTools

julia> function charToDigit(ch::Char)::UInt
           ch = lowercase(ch)
           ch == 'e' && return 0
           ch in ['j', 'n', 'q'] && return 1
           ch in ['r', 'w', 'x'] && return 2
           ch in ['d', 's', 'y'] && return 3
           ch in ['f', 't'] && return 4
           ch in ['a', 'm'] && return 5
           ch in ['c', 'i', 'v'] && return 6
           ch in ['b', 'k', 'u'] && return 7
           ch in ['l', 'o', 'p'] && return 8
           ch in ['g', 'h', 'z'] && return 9
           throw(DomainError(ch, "Not a letter"))
       end
charToDigit (generic function with 1 method)

julia> @benchmark charToDigit(c) setup = (c = rand(['a':'z'; 'A':'Z']))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):    5.214 ns …   1.915 μs  ┊ GC (min … max): 0.00% … 81.68%
 Time  (median):     131.793 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   146.142 ns ± 145.949 ns  ┊ GC (mean ± σ):  8.14% ±  8.71%

   █▇▆ ▂▃▆▁▃▆▆                                                   
  ▇███▆███████▇▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  5.21 ns          Histogram: frequency by time         1.17 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

If you replace arrays (which are mutable, heap-allocated) with tuples (which are immutable, stack-allocated) median time of this function goes down by an order of magnitude:

julia> function charToDigit(ch::Char)::UInt
           ch = lowercase(ch)
           ch == 'e' && return 0
           ch in ('j', 'n', 'q') && return 1
           ch in ('r', 'w', 'x') && return 2
           ch in ('d', 's', 'y') && return 3
           ch in ('f', 't') && return 4
           ch in ('a', 'm') && return 5
           ch in ('c', 'i', 'v') && return 6
           ch in ('b', 'k', 'u') && return 7
           ch in ('l', 'o', 'p') && return 8
           ch in ('g', 'h', 'z') && return 9
           throw(DomainError(ch, "Not a letter"))
       end
charToDigit (generic function with 1 method)

julia> @benchmark charToDigit(c) setup = (c = rand(['a':'z'; 'A':'Z']))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):   5.727 ns … 52.873 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.525 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.922 ns ±  5.315 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▆▄   █▃   ▆▄   ▁▅   ▃▄    ▃  ▁▁▃                        
  ██▂▁▂██▇▁▂██▆▁▇██▃▇▅██▆▃████▅▄███▆███▄▅▆█▅▄▃▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁ ▄
  5.73 ns         Histogram: frequency by time        29.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

As others have said, this kind of optimisation is completely dwarfed by Julia’s startup time, if you’re going to include it.

6 Likes

Please don’t repeat the julia foo.jl style execution many times.

If you stick to the julia foo.jl style execution, please include the values of the resources required for compilation in the results of benchmark tests for other languages.

In order to run benchmark tests of Julia, it is better to use BenchmarkTools.jl package of Julia.

Point: Do not measure all languages in the same way, but measure each language in a separate and appropriate way.

2 Likes

I think it’s ok to test how Julia performs as a CLI tool, but since we know there is a significant overhead, it’s important to highlight this, and try to show what is runtime and what is overhead.

Julia is kind of in a special place with its JAOT/JIT execution, so flat out comparisons without some explanation often feel a bit unfair, given how the language is normally used.

2 Likes

Here’s a version that is a bit faster (about 10x), and hopefully still correct (haven’t checked)

https://gist.github.com/jonathanBieler/de37a190d590297fc6b5d0ffee3c18dc

You just need to change the path in main() and then include in the REPL.

Biggest change was using in-place operations for BigInts, I haven’t found one for addition of small integers so I had to wrap it myself (was pretty easy though).

3 Likes

If it needs to run from the CLI, you might want to try something like DaemonMode.jl,
which will allow you to avoid restarting Julia each time you want to run your script.

https://github.com/dmolina/DaemonMode.jl

4 Likes

I posted a somewhat optimized version to Reddit (under the name viralinstruction) - IIRC it was about 6x faster, but there is still more to be done.

One important question when doing these optimizations is how much we are allowed to optimize, and it can get a little tricky. Can we change the algorithm, for example? What about small changes to the algorithm, like pushing and popping to the words vector as you describe in the blog post?

And how important is readability? Can we use external packages? Without these constraints, optimization games tend to go a bit off the rails and stop looking like good idiomatic Rust/Julia/whatever code.

Edit: And can we implement a custom type which is faster than BigInt, i.e. like a small static array that can be interpreted as a phone number? That would be faster, but… See, that’s the problem. In normal coding, at some point there is a voice that says “no, this is insane, and not worth the performance”. But when doing benchmark games, you get in this weird situation where everything is speed, with no other priorities.

3 Likes

Hey!

Thanks for all the answers, now I have a better understanding of why Julia was so slow in the benchmarks.

Notice that I also ran Java from the CLI while I could have used a daemon or even compiled it to a binary.

The Common Lisp code was executed via sbcl --script main.lisp which as in Julia, includes even the compilation step. I also compiled it to binary though, but the compilation time was so small it didn’t affect the results so I didn’t include that in the benchmark.

I agree though that if it’s possible to compile Julia first, I should try to do that… but still, the benchmark I want needs to run from the CLI, not from some kind of REPL, as I am interested in the performance from an end-user point of view, where my users don’t know or care which language my app is written in (maybe that’s not a target audience for Julia apps?).

Jakob, I think it’s important to maintain the “main” algorithm. Using a mutable list for solutions is fine as that doesn’t (in my opinion) change the “main” algorithm (let’s define the main algorithm as using some kind of lookup table to index words, then looking at each phone number digit by digit trying to locate matching words in the lookup table - everything else is not relevant). Keeping readability is desirable… the idea is to know how much pain you must go through to get code in your language to perform well! With Lisp, it was basically none: first solution is incredibly performant. With Rust and Java it did take effort.

Will find out soon about Julia (I suspect most people are right, the first version had very obvious mistakes that within a month of experience with Julia I wouldn’t have made, but if we need to get into obscure territory to get relatively good performance, then of course that’s a problem when choosing Julia, I think anyone will agree).

Thanks.

2 Likes

That’s right. The target for Julia in my opinion is primarily scientific computing, of course it can be used for lots of things, but when there are two competing interests and one choice is to do the thing that most people in scientific computing want, and one is to do the thing that say web app programmers or desktop app programmers want… the scientific computing case will be the one taken

The target audience is basically similar to the audience for Matlab, or R, or Fortran. In this context often we’re talking about individual computations taking minutes to months, so the startup time is of concern but less so than the run time of large tasks. Still there’s effort to reduce start-up time, just not if it will have a significant cost to the “real” runtime.

2 Likes

I should have mentioned: yeah, I made a mistake of running small inputs before… but my idea is to run the programs for around 1 minute or more so I really think the startup time overhead, as with Java , should disappear. If it doesn’t it must be really huge :smiley: .