Help to get my slow Julia code to run as fast as Rust/Java/Lisp

jzr · August 5, 2021, 2:29am

While we’re on the subject, where do I go to follow the “quick-startup small standalone executable” developments? Is that conversation happening in PackageCompiler, StaticCompiler, GPUCompiler, elsewhere?

StefanKarpinski · August 5, 2021, 2:31am

PackageCompiler is the most standard way to go. Nobody said anything about “small” here — you snuck that requirement in. Every time I press someone on why they need a small executable I get unconvincing reasons.

jzr · August 5, 2021, 2:59am

I’m not personally hankering for it at the moment, but maybe worth mentioning some of the reasons that are out there:

Overall, it depends how small.

Downloading 10s or 100s of MB over the network is slow. E.g. if the program is running from an NFS filesystem, or downloading in a browser, or downloading in an OS package manager.

At smaller scale, I’ve heard (but haven’t measured myself) that programs can get a speed boost if they fit in the L1 cache. Then there’s embedded systems with limited memory.

Ronis_BR · August 5, 2021, 3:19am

In my case, I have a simulator with a complete satellite attitude control software written in Julia. The simulator is divided in “world simulation” and “embedded software”. The latter contains all the functions that we need to program into the satellite computer. It would be awesome to generate compiled code from that part to be used as the firmware, with only minor modifications in the interfaces. The problem is that rad-hard memory is expensive. One famous computer used a lot has only 256MB to store everything.

I know it is a super specialized example but I hope I could provide one application that will benefit if Julia can create small binaries

dlakelan · August 5, 2021, 3:59am

The main reason I can think of for small is embedded systems. Of course as of now I don’t think you can run Julia on something like a MIPS based router at all? But these days even routers are moving to ARM with 512MB or more of RAM. My core router is an RPi4 with 4GB. If I needed to run a Julia program to analyze statistics of say latency or bandwidth usage I could do it no problem on that platform.

renatoathaydes · August 5, 2021, 5:38am

I see that some people are trying to help by running benchmarks using the wrong input files. The default files are used to manually verify your program is printing something that makes sense, not for benchmarking. The benchmark script, if you look at it, generates input files of several sizes and measures programs against those (I needed a custom tool because I wanted to measure not just speed but also memory usage). The new benchmark even downloads another dictionary to run programs against.

Also, as I’ve already mentioned, the last blog post was based on another benchmark which ran the programs for minutes. I will not run those benchmarks with Julia because we found out that much of the time on very large inputs was being spent on printing out solutions rather than finding them. That’s why I said I will first try to reduce the number of solutions found by modifying the algorithms before I attempt measuring anything again.

It is highly surprising to me that a lot of people here are against using Julia outside its own REPL (even other people doing it, apparently, even if that’s not how they want to do). As if the only users of Julia programs are Julia programmers. This is making me consider moving away from anything related to Julia because I, as mentioned already, have a desire to write applications for anyone, not for other programmers.

Maybe my problem is not worthy of the greatness of Julia. Sorry for taking your time.

jzr · August 5, 2021, 6:20am

I thought that was a joke (or at least tongue-in-cheek)?

Anyway, people are evidently willing to help. But many of us don’t have java installed. Is there another way to obtain the relevant benchmark data file?

renatoathaydes · August 5, 2021, 6:34am

@jzr why you think this is a joke? It seems people are very serious I should not run Julia outside the shell.

To run the benchmarks you need to checkout the repo. The input files are generated by a Java program so you need JDK16 installed.

EDIT: Actually , the large benchmark doesn 't require Java. You can run /benchmark-english-dict.sh after uncommenting-out Julia, and comment out other languages if you want… that one will run for a long time and in Julia it never ends from what I see.

You can comment out all other languages.

I just ran the “small” benchmark (despite warnings I need to improve the benchmarks!!), here’s what I got running the highly optimised Julia program posted above when compared to Rust:

Generating inputs
Checking all programs for correctness
Checking: ./phone_encoder
OK
Checking: julia src/julia/phone_encoder.jl
OK
Benchmarking...
===> ./phone_encoder
Proc,Memory(bytes),Time(ms)

./phone_encoder,23547904,101
Proc,Memory(bytes),Time(ms)

./phone_encoder,23531520,100
Proc,Memory(bytes),Time(ms)

./phone_encoder,23564288,141
Proc,Memory(bytes),Time(ms)

./phone_encoder,23576576,315
Proc,Memory(bytes),Time(ms)

./phone_encoder,23572480,538
===> julia src/julia/phone_encoder.jl
Proc,Memory(bytes),Time(ms)

julia,221229056,1074
Proc,Memory(bytes),Time(ms)

julia,230445056,1113
Proc,Memory(bytes),Time(ms)

julia,300412928,1605
Proc,Memory(bytes),Time(ms)

julia,405135360,3834
Proc,Memory(bytes),Time(ms)

julia,451584000,6331

The Rust program runs in 0.5 seconds , but Julia takes over 6 seconds on the larger one.

I understand Julia may have a high startup overhead, but this is running for several seconds… it shoud start to catch up like Java does? The times are exponential so running my large benchmark on Julia will never end if the trend continues…

renatoathaydes · August 5, 2021, 6:36am

If anyone wants to try it:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/tree/julia-optimisations

Run ./benchmarks.sh

If you have a few hours to spare, run also ./benchmark-english-dict.sh after adding Julia to it.

DNF · August 5, 2021, 7:11am

Is this really an appropriate tone after several people have put in quite a bit of time trying to help you?

Some people have warned against the overhead of running from the CLI, but several have also said this is a reasonable thing to do, especially for long-running tasks.

As far as I can tell, there are two distinct discussions going on here. One is about speeding up your benchmarks. And then there’s a second one that is tangentially related, about the suitability of Julia for CLI use. This is pretty common, and there’s no reason to take it personally.

skleinbo · August 5, 2021, 7:34am

I believe it’s better to use a main function. Might not be a major performance point here, but by wrapping the code, the compiler should be able to infer that dict is a dictionary. Many performance issues in Julia stem from type inference issues.

74 function main()
 75     dict = open(isempty(ARGS) ? "tests/words.txt" : ARGS[begin]) do file
 76         loadDictionary(file)
 77     end
 78     io = IOBuffer()
 79     open(length(ARGS) < 2 ? "tests/numbers.txt" : ARGS[begin+1]) do file
 80         for num in eachline(file)
 81             printTranslations(io, dict, num, filter(isdigit, num))
 82             print(String(take!(io)))
 83         end
 84     end
 85     close(io)
 86 end
 87
 88 @time main()
 89 @time main()

I ran that file with time julia --startup-file=no src/julia/phone_encoder.jl dictionary.txt input.txt, i.e bypassing your benchmark_runner. Takes about 1.2s (on my machine) in total. Each execution of the main routine however takes about 0.2s. This should give you an idea of startup overhead.

~~Unfortunately I cannot generate the various dictionaries, because I can’t compile the Java classes since OpenJDK is only available in Version <=14 on Ubuntu 20.04.~~

genkuroki · August 5, 2021, 7:37am

The Julia counterpart to a UNIX-style shell is not a shell, but the REPL of Julia.

Just as UNIX users use a shell to combine multiple commands, Julia users use REPL to combine multiple functions.

I think something similar to this can be found in MATLAB, R, and Mathematica. (They are very different from Java and Rust.)

For many of these users, it is difficult to get around the UNIX-style command line interface, and it is easier to use REPL and its extensions.

Given this reality, I think the assumption that ordinary users can use the command line interface from the shell is not always realistic, needs to be reconsidered, and is not fair unless the use of REPL and its extensions is included in a realistic option.

However, it is reasonable enough to run a Julia scrip from a UNIX-style CLI to perform a calculation that takes more than one minute. I’ve written a Jupyter notebook to help someone optimize for that.

I’m sure Julia users would love such an optimization game.

renatoathaydes · August 5, 2021, 8:01am

You used the files used to verify the program runs correctly, not the benchmark files. The first run of my bash script uses files that are in the same order of magnitude and they show Julia taking 1.074 seconds, which is kind of in agreement with what you see on the total time including startup.

Notice that each run of my benchmark uses different inputs sizes, as I explain in the blog posts and you can see in the bash script. The sizes are 1,000, 10,000, 50,000 and 100,000. Julia goes from 3 seconds with 50,000 phone numbers to 6 seconds with 100,000 phone numbers. That cannot be explained by startup overhead.
You can use whatever timing tool you want, just make sure to use the different file sizes (I recommend porting the PhoneNumberGenerator.java file to Julia if you can’t run Java as it’s very easy). If anyone wants to install Java 16, it takes a second on Mac, use sdkman: https://sdkman.io/ then run:

sdk install java 16.0.2-open

PS. About my tone: didn’t mean to be snarky… just honest: people are suggesting I should only use Julia on problems that are somehow worth of it and that maybe my problem should just stick with Java. To those who are genuinely trying to help me get a fair view of Julia so I don’t go away with the wrong impression, I apologise.

renatoathaydes · August 5, 2021, 8:03am

Correction: sdkman (to install Java) works on Mac AND Linux, just not on Windows unless you use a POSIX compatible shell.

renatoathaydes · August 5, 2021, 8:08am

Also, the large benchmark which I referred to above uses a single phone number of length 23 and a larger dictionary with 100,000 words.
This makes all programs take minutes to run, I can’t get Julia to run on that benchmark though as it’s just taking too long.

One more thing: as I already mentioned, this big benchmark requires printing around 2BG of solutions, which makes printouts a bottleneck (unlike in the smaller benchmark which has only thousands of solutions).
Just wanted to try to make this completely clear. I will alter the algorithm to reduce the possible number of solutions in order to avoid this problem in the future.

Not sure I will include Julia in the study because of the issues we’re seeing here, unfortunately.

Others should feel free to run their own studies though if you don’t agree with my methodology.

Regards.

DNF · August 5, 2021, 8:35am

Not sure if this has already been mentioned, but I see that you do

[words; word]

in two places inside printTranslations. This could be quite costly, as it allocates a new array over and over. I think there is a way you could use push!(words, word) instead, which can be much faster. If I understand correctly, you do not want to keep word around inside words after you go back up out of printTranslations, but you could fix that by calling resize! at an appropriate time.

Would this work?

function printTranslations(io, dict, num, digits, start=1, words=String[])
    if start > length(digits)
       return println(io, num, ": ", join(words, " "))
    end
    foundWord = false
    n = BigInt(1)
    Nwords = length(words)  # remember this
    for i in start:length(digits)
        Base.GMP.MPZ.mul_si!(n, 10)
        Base.GMP.MPZ.add_ui!(n, nthDigit(digits, i))
        for word in get(dict, n, emptyStrings)
            foundWord = true
            printTranslations(io, dict, num, digits, i + 1, push!(words; word))
        end
    end
    if !foundWord &&
        !(!isempty(words) && length(words[end]) == 1 && isdigit(words[end][begin]))
        printTranslations(io, dict, num, digits, start + 1, push!(words, string(nthDigit(digits, start))))
    end
    resize!(words, Nwords)  # here we ditch all the appended words
end

This is not at all tested, I just wrote the code directly here.

Edit: Also worth noting is that push! is O(N) for N = number of found words, while [] is O(N^2), so it becomes increasingly significant for large numbers of words. This could be part of the poor scaling you are seeing, and maybe there are other things like this going on in other parts of the code.

mkitti · August 5, 2021, 8:37am

For what you are trying to do, it definitely would be appropriate to use PackageCompiler to do ahead----of-time compilation since you are not including compilation time for any of the other cases.

Here’s a direct link to the documentation on how to do that. This includes creating a binary executable by using a bit of C to embed Julia.
https://julialang.github.io/PackageCompiler.jl/dev/devdocs/binaries_part_2.html

Have you tried this yet?

renatoathaydes · August 5, 2021, 8:54am

@DNF you’re correct, we can just mutate the list if Julia is slow to create immutable lists as the original code tried to do (and it’s what was done in Lisp and it was very efficient there - could this code have used linked lists maybe to make it more comparable?).

You just need to pop the word after the recursion returns which seems that you’ve forgotten to do?

DNF · August 5, 2021, 8:56am

On the last line I do resize!(words, Nwords), because it looks like you shouldn’t pop each word every time, you want to propagate it into the recursive call. Or is that wrong?

Arrays aren’t immutable, btw. (Edit, oh you meant you tried to make immutable lists first?)

renatoathaydes · August 5, 2021, 8:57am

Hm… some people said above it would not be a good idea to use PackageCompiler, I am getting mixed messages! Anyway, I will try this after work because I am sure Julia can run fast, and I really want to understand why it’s being just so slow even on larger inputs and it’s linearly increasing with input size instead of exponentially getting a faster rate on larger inputs as you would expect if startup overhead was to blame!!

Also, notice that CommonLisp benchmarks ran from source as well and also had to compile first. And Java ran from bytecode , not native code either (which means it runs interpreted until it’s JIT’ed, hence the huge increase in performance on larger problems, which I expected to see also in Julia but failed so far).

Topic		Replies	Views
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	5972	November 4, 2023
Code slower than Rust equivalent New to Julia performance , rust	17	1323	October 8, 2021
Comparison of Rust to Julia for scientific computing? Offtopic question , rust	123	33775	July 17, 2023
Julia slower than Python to sort and reverse a list of integers Performance	40	2567	April 28, 2023
Why is julia slowest here? Performance performance	4	1132	March 6, 2022

Help to get my slow Julia code to run as fast as Rust/Java/Lisp

Related topics