Help to get my slow Julia code to run as fast as Rust/Java/Lisp

Your code isn’t very large, so I don’t see how startup/compilation should matter much for large lists of numbers.

I don’t know if your algorithm is inherently O(N) or O(N^2), but if it’s linear you should see a + b N, with a quickly becoming insignificant. Right now though, with [words; word], at least that part of your algorithm is O(N^2), which may or may not be a concern for the sizes you are looking at.

So do the benchmarks indicate polynomial (e.g. quadratic) or exponential behaviour?

1 Like

BTW, resize! is super efficient, it doesn’t really do any work, just changes the length property without modifying the underlying data.

I’m not sure, but this may also mean that the memory remains available, so that push! rarely needs to allocate anything later on. If this is not the case, you could experiment with sizehint! to keep that memory ‘on hand’ for later iterations.

Hmm, is this necessary?

@eval Base.GMP.MPZ begin
    add_ui!(x::BigInt, a::BigInt, b) = (ccall((:__gmpz_add_ui, :libgmp), Cvoid, (mpz_t, mpz_t, Clong), x, a, b); x)
    add_ui!(x::BigInt, b) = add_ui!(x, x, b)
end

These methods already exist, unless I’m blundering terribly.

:man_facepalming:

I’m kind of disappointed in the reaction this has gotten from our community over the past few hours. I feel a little bit responsible about this since I mentioned the benchmarking tooling first thing and even though I followed this up with constructive feedback on what could be responsible for bad performance (in spite of the high startup time), seems like this thread took a sour turn. I’m very sorry about that. I honestly expected better here and I really want to say that this isn’t (or SHOULDN’T!) be representative of how our community interacts with someone who’s clearly out of their water when writing julia code. I hope we as a community can do better in the future and respect that we are no authority on how other people want to run their code. Sometimes this is due to external constraints, sometimes it’s simply because that’s how they prefer to do things. In either case, we shouldn’t harp on and on about how they’re doing it wrong when they’ve clearly stated why they’re doing it this way. If it has been mentioned and been addressed, there’s no need to pile on. Instead, if you don’t have anything constructive to add, please stay quiet.


I was asleep the past few hours, so I only now got to try running this benchmark for myself. I installed a recent java version (openjdk-16-jdk), updated my rust install (from 1.34.0 to 1.54.0) and taken a little bit deeper look at the julia code.

I ran into a little snag as soon as I tried to benchmark the original version from your repo (this is the julia-optimisations branch), to establish a baseline:

:prechelt-phone-number-encoding $ cd src/rust/benchmark_runner/                                       
:benchmark_runner $ cargo build --release                                                             
   Compiling benchmark_runner v0.1.0 (/d/Documents/Projects/prechelt-phone-number-encoding/src/rust/benchmark_runner)
error[E0432]: unresolved import `libproc::libproc::pid_rusage`                                                       
 --> src/main.rs:7:23                                                                                                
  |                                                                                                                  
7 | use libproc::libproc::pid_rusage::{pidrusage, RUsageInfoV4};                                                     
  |                       ^^^^^^^^^^ could not find `pid_rusage` in `libproc`                                        
                                                                                                                     
error: aborting due to previous error                                                                                
                                                                                                                     
For more information about this error, try `rustc --explain E0432`.                                                  
error: could not compile `benchmark_runner`                                                                          
                                                                                                                     
To learn more, run the command again with --verbose.

So I guess I’m stuck with comparisons until I/we can fix this. I’m not a rusthead, mind you, so I’ll need a little help to resolve this.

Nonetheless, some more things I’ve noticed in the julia code that are probably not helpful in regards to performance:

  • join can take an io as its first argument. If done, it doesn’t have to allocate a new String on invocation, just to immediately write it to the io given to println.
  • print(String(take!(::IOBuffer))) is, in my eyes, an antipattern because it has the same problem as join above - allocating a new String, just to print it out. A better approach is write(stdout, take!(::IOBuffer)), since that skips allocating the new string and just directly writes out whatever the IOBuffer holds. Benchmarking shows this still allocates (less than creating the String though), but I think there’s a way around that. will update. Use write(stdout, seekstart(::IOBuffer)) instead.
I've timed one run with 100_000 phone numbers in julia with `@timev` and got the following result (rust took ~0.46s).
  1.515357 seconds (4.79 M allocations: 175.866 MiB, 9.13% gc time, 0.55% compilation time)
elapsed time (ns): 1515357500                                                              
gc time (ns):      138356800                                                               
bytes allocated:   184409298                                                               
pool allocs:       2924124                                                                 
non-pool GC allocs:158                                                                     
malloc() calls:    785214                                                                  
realloc() calls:   1077855                                                                 
free() calls:      530567                                                                  
GC pauses:         4                                                                       
full collections:  1                                                                       

This is running

julia> versioninfo()                           
Julia Version 1.7.0-beta3                      
Commit e76c9dad42 (2021-07-07 08:12 UTC)       
Platform Info:                                 
  OS: Linux (x86_64-linux-gnu)                 
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64                                
  LIBM: libopenlibm                            
  LLVM: libLLVM-12.0.0 (ORCJIT, skylake)       
Environment:                                   
  JULIA_PKG_SERVER =                           
  JULIA_NUM_THREADS = 4                        
Fixing the `join` thing I get this.
  1.509860 seconds (4.61 M allocations: 167.470 MiB, 9.10% gc time, 0.38% compilation time)
elapsed time (ns): 1509860100                                                              
gc time (ns):      137324100                                                               
bytes allocated:   175605512                                                               
pool allocs:       2744532                                                                 
non-pool GC allocs:156                                                                     
malloc() calls:    785214                                                                  
realloc() calls:   1077855                                                                 
free() calls:      562247                                                                  
GC pauses:         4                                                                       
full collections:  1                                                                       

At this point, my computer is runnign rather hot and starts to throttle, but I’ve done some more investigation and I think the fix by @DNF doesn’t work because it only resize!s at the end of the function call, leading to wrong results (I checked with diff against the rust version). Using pop! after each recursive call works though.

timing from that last run, not representative though
vbogad@Taktikum:prechelt-phone-number-encoding $ diff <(sort out_jl_2.txt) <(sort out_rust.txt)
4516d4515                                                                                      
<   1.845223 seconds (8.10 M allocations: 280.115 MiB, 7.58% gc time, 0.21% compilation time)  
29827,29835d29825                                                                              
< bytes allocated:   293722264                                                                 
< elapsed time (ns): 1845223400                                                                
< free() calls:      631601                                                                    
< GC pauses:         6                                                                         
< gc time (ns):      139949100                                                                 
< malloc() calls:    785214                                                                    
< non-pool GC allocs:156                                                                       
< pool allocs:       6233025                                                                   
< realloc() calls:   1077855                                                                   

You can find the last code I ran in this gist. It includes other optimizations, like not recreating strings unnecessarily and a call to @timev (which is where the output in tht diff came from).

I’d recommend running on 1.7, since that alone seems to speed up your original code massively.

12 Likes

I would like to quickly comment on this, since I think that there is a fundamental misunderstanding.

It is not that people are “against using Julia outside its own REPL” but rather outside its already warmed up process.

There are many people working (devs, users) in the REPL, in VSCode with a running Julia process or Jupyter/Pluto notebooks with a kernel and that’s fine, it gives you a responsive environment. When it comes to large data processing on computing farms (PBS, Slurm, …), you need – with a few exceptions like MPI, Distributed.jl etc. – scripts and even containers (Docker, Singularity, …) etc. and there, Julia scripts are fairly common. I personally run Julia scripts on a daily basis :wink:

Why would you care about the execution time of a script when it’s just a few seconds? Because you probably run it thousands or millions of time, right?

I’d like to name just a simple and stupid example which I encounter often: let’s say you want to use a command line application to process a bunch of files. If the overhead to get the programme ready is large compared to the overall execution time, it makes sense to “iterate” within the script and not externally.

This means that instead of

$ for file in echo *.txt
    julia [options] $file
  done

You would do

$ julia [options] *.txt

or even

$ julia [options] /path/to/folder/containing/all/files

I am sorry if this is too obvious but I have seen this in the wild many times where the actual parameter scanning is happening in the shell rather than defining them in a nice configuration file which is passed to a Julia programme, which then takes care of launching large combinations of those.

The same holds for doing

$ julia [options] $file > output.txt

instead of

$ julia [options] -o output.txt $file

and let Julia handle the output.txt generation in an effective way. It even gives you the ability to write it in a more abstract way, so that you can change the output file format support later with ease, if you e.g. realise that using a binary output (HDF5, Arrow…) is more efficient. The users can then simply do -o output.hdf5 and Julia picks the right routine to generate the output.

To start off with more sophisticated command line tools written in Julia, I recommend trying

4 Likes

Ah, yeah, I see now that we don’t want to keep growing the list for each iteration. I think there’s still some advantage to be had by push!/pop!, especially if the recursion depth becomes large.

Does pop! work the same as resize!, btw? It would be good if the memory isn’t reclaimed each time.

Right, I missed them in the source since they are defined with metaprograming (and the library is mostly undocumented right ? haven’t found any). So @renatoathaydes I think you can remove this @eval Base.GMP.MPZ begin block and it should work the same.

1 Like

PackageCompiler is not how a large part of the Julia community uses Julia on a day to day basis, but for what you want to do - use Julia from the CLI and include startup time - it is an absolute necessity. If I’m going to deploy code to production and didn’t need the flexibility of rapid iteration, I would do this.

There’s a lot to discuss on why that is needed and how it differs from static ahead-of-time compilation in other languages. The gist is that Julia’s combination of dynamism and native code generation creates unique situations such as code invalidation. By defining or overriding new methods, not only do you create new code to compile but you can also invalidate previously compiled code forcing recompilation of library code.

If you want to get a “byte code” equivalent, there’s some steps to take in that direction also, such as putting your code into a module and adding a Project.toml with a UUID. That would at least give Julia the chance of caching lowered and type inferred code. This will be saved into .ji files in your JULIA_DEPOT. PackageCompiler will take care of some of this for you, but it’s still good form to do it.

You should definitely consider keeping the code in main function or maybe a module’s __init__. You’re using non-constant globals which creates issues with type inference and compilation. Globals can can change type, and that creates issues for compilation.

Using @eval to “patch” Base.GMP.MPZ is probably not great and it’s not clear me that you need to do it that way. You could probably just import the methods into your new module and extend the methods there. You’re probably invalidating a lot of precompiled code in the system image by doing that.

Here’s some more general tips:

https://docs.julialang.org/en/v1/manual/performance-tips/

Just to summarize:

  1. Use modules and add a Project.toml to facilitate caching of code lowering, type inference, and various precompilation steps.
  2. Use PackageCompiler to compile to native code to avoid dynamic (re)compilation issues.
  3. Don’t @eval into modules
  4. Put your code into functions and don’t use global scope.
1 Like

Yeah, it’s hard to figure out the right functions to use. There’s Integer Functions (GNU MP 6.2.1) which can be useful, but still a bit bewildering.

If BigInt allocations become an issue, then perhaps n should be propagated into each call, and reset to 1 with set_si!(n, 1).

2 Likes

Only if startup/overhead is dominating, surely?

1 Like

100%. They both call Base._deleteend! internally which calls into jl_array_del_end in C. It’s just a semantic difference, since pop! has a shrinking size of 1 hardcoded, while resize! allows userinput.

1 Like

If the comparison is to Rust and Java, then yes. What’s the downside? Are we benchmarking build time or binary size as well? We just invite more complications and nuance by avoiding PackageCompiler.jl.

On my laptop, I’m seeing

big"1"

being a lot more efficient than

BigInt(1)

I even see

jl> function bar(v)
    s = big"0"
    for val in v
    Base.GMP.MPZ.add_ui!(s, val)
    end
    return s
    end
bar (generic function with 1 method)

jl> @benchmark bar(v) setup=(v=rand(1:1000, 100))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     260.000 ns (0.00% GC)
  median time:      266.176 ns (0.00% GC)
  mean time:        279.102 ns (0.00% GC)
  maximum time:     1.329 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     340

Zero allocations?! What’s that about? Just running the allocation itself, big"0" is 30x as fast as BigInt(0). Not a big deal in the total picture, but if there are a lot of initialization, it may matter a bit.

Does big"0" reuse some global variable somehow?

Unless we’re certain that we can’t optimize the original code further, what’s the point in bringing the startup time down? Focus on the algorithm first, then we can try to eliminate startup overhead that’s the same for all algorithms, no matter how bad.

1 Like

Yes, big"0" reuses the same variable. BioSequences.jl uses a similar dna"TAGCA" with similar effects - there is more information about the issue here: Constructing sequences · BioSequences.jl

@DNF pinging you here, due to your later comment.

Figured it out, this is the best way to avoid allocations when writing from an IOBuffer:

a = IOBuffer()
... 
# writing to it
for _ in 1:100
   print(a, "a")
end
write(stdout, seekstart(a))

Necessary because IOBuffer only has one pointer for both reading and writing :man_facepalming: With this and the optimization of join, I think the only thing left is unnecessary creation of strings in the recursion after calling nthDigit, as well as the BigInt stuff.

It also appears that mul_ui! is several thousands of times slower than add_ui!. Maybe there’s some way of speeding up multiplication by 10? I guess multiplication tends to ‘grow’ the number, while addition rarely does. A sizehint! for BigInt could be good, perhaps?

But before diving into this, did anyone do profiling, to see what the actual bottlenecks are?

Wow! Is this expected?

function foo(N)
    s = big"1"
    for i in 1:N
    Base.GMP.MPZ.mul_ui!(s, 10)
    end
    return s
end

jl> foo(1)
10

jl> foo(1)
100

jl> foo(1)
1000

jl> foo(1)
10000

This fixes the problem, and speeds mul up dramatically, but now I’m wondering what sort of side-effects are happening here:

function foo(N)
    s = Base.GMP.MPZ.set_ui!(big"1", 1)
    for i in 1:N
        Base.GMP.MPZ.mul_ui!(s, 10)
    end
    return s
end
2 Likes

I don’t think so - I for one have so far only focused on the string creation side, because of julia’s immutable strings. Creating new ones left and right is usually a big bottleneck in these kinds of benchmarks, so I tried to avoid them.

You can’t use it in the real algorithm though, because that one assumes independent n for each recursive call. Not sure how this is solved in the rust or java versions, I haven’t looked at them.

Replacing the BigInt with two Int128s and allocating fewer arrays (while printing and during the nested call to printTranslations) brings down runtime by a factor of ~15 for me:

j1 -O1 src/julia/phone_encoder.jl dictionary.txt input.txt  0.95s user 0.69s system 252% cpu 0.652 total

I haven’t read through all of this thread though, so dunno how this compares to the optimizations others have come up with.

2 Likes