Broadcast function performance of Julia vs Numpy vs Nd4J

Hi everybody!

Recently suddenly faced the fact, that in Julia, applying function to a matrix a bit(but consistently) slower than in Numpy(Python) or Nd4J(Java), and that was a bit surprised me because I believe all using BLAS. under the hood (but maybe I’m wrong). Is there some flags to improve that (e.g. optimization in JIT compiler) or maybe I’m doing something wrong?

Julia

julia> a=zeros(300,400)
julia> @time for i in 1:10_000; exp.(a); end
 12.050691 seconds (40.00 k allocations: 8.941 GiB, 5.82% gc time)

Numpy

from time import time

import numpy as np

m = np.ones((300, 400))
t0 = time()
for i in range(10_000):
    np.exp(m)
print(time() - t0)

//OUTPUT: 6.567444801330566

Nd4J

import org.nd4j.linalg.api.ops.impl.transforms.Exp
import org.nd4j.linalg.factory.Nd4j
import kotlin.time.ExperimentalTime
import kotlin.time.measureTime

@OptIn(ExperimentalTime::class)
fun main() {
    val a = Nd4j.zeros(300, 400)
    val dt = measureTime {
        for (i in 1..10_000)
            Nd4j.getExecutioner().execAndReturn(Exp(a))
    }
    println(dt)
}

//OUTPUT: 5.697226100s

It depends on the installation, but Numpy often uses Intel’s closed source vectorization libraries for things like this. Julia’s exp function is written mostly with scalar performance in mind, not broadcasting performance, but packages like LoopVectorization.jl can easily fix that

using BenchmarkTools, LoopVectorization
a = zeros(300, 400);
julia> @benchmark exp.($a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  313.740 μs … 861.920 μs  ┊ GC (min … max): 0.00% … 55.36%
 Time  (median):     322.050 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   330.233 μs ±  48.779 μs  ┊ GC (mean ± σ):  2.03% ±  6.97%

   █▄▃▁▁                                                        ▁
  ██████▇▆▅▃▄▄▁▁▄▄▄▃▃▃▃▃▁▁▁▃▁▁▁▄▃▁▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▃▁▄▅▆▆▆▆▆▆▆▅▆▄ █
  314 μs        Histogram: log(frequency) by time        644 μs <

 Memory estimate: 937.55 KiB, allocs estimate: 2.
julia> @benchmark @turbo exp.($a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   91.170 μs … 520.210 μs  ┊ GC (min … max): 0.00% … 77.10%
 Time  (median):      95.495 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.293 μs ±  33.370 μs  ┊ GC (mean ± σ):  4.44% ±  9.41%

  █▇▆▄▂                                                         ▂
  ██████▇▅▄▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▄▃▄▄▄▄▄▆▆▇▇█ █
  91.2 μs       Histogram: log(frequency) by time        325 μs <

 Memory estimate: 937.55 KiB, allocs estimate: 2.

I should also mention that just because Numpy has specific functions (which are really just foreign function calls) which can outperform specific julia functions, does not really say anything meaningful.

The important difference is that it’s possible to write large-scale, complex, and novel code in pure julia which can have top tier performance. That’s often just not possible in Python because Python’s model of stringing together fast C code with slow glue code can introduce highly problematic performance bottlenecks.

1 Like

For new folks, whipping out BenchmarkTools.jl can seem like moving the goal posts a bit, although BenchmarksTools.jl is absolutely the right way to do it. That said, I just wanted to demonstrate that you can observe the acceleration with just plain @time as well.

# Package setup
begin
    using Pkg
    Pkg.activate(; temp = true)
    Pkg.add("LoopVectorization")
    using LoopVectorization
end

# Method definitions
begin
    const a = zeros(300, 400);

    g() = for i in 1:10_000; exp.(a); end;
    h() = for i in 1:10_000; @turbo exp.(a); end;

    precompile(g, ())
    precompile(h, ())
end

julia> @time g()
 10.709063 seconds (20.00 k allocations: 8.941 GiB, 6.86% gc time)

julia> @time h()
  4.851597 seconds (20.00 k allocations: 8.941 GiB, 13.41% gc time)

For comparison, here is the Python version on my machine:

In [9]: t0 = time()
   ...: for i in range(10_000):
   ...:         np.exp(m)
   ...:
   ...: print(time() - t0)
6.147451400756836

You may be wondering what LoopVectorization.@turbo does. This is a macro that let’s Julia use AVX accelerate version of exp.

julia> @macroexpand @turbo exp.(a)
:(let var"%1" = Base.broadcasted(exp, a)
      LoopVectorization.vmaterialize(var"%1", Val{:Main}(), LoopVectorization.avx_config_val(Val{(true, 0, 0, 0, true, 1, 1, true)}(), static(0)))
  end)
5 Likes

This is not longer a probelm in modern Python ecosystem. The popular Python packages like JAX solves exactly this problem.

JAX lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python

That’s great for Jax code, but Jax is really better thought of as a separate language from Python with the same syntax and subtly different semantics.

The vast majority of the Python ecosystem is incompatible with Jax.

2 Likes

Definition of h is missing

1 Like

And also where mentioned LoopVectorization.@turbo is used?

Good point, I forgot to include the definition of h(). I’ve edited it to include it. You can also see in Mason’s example above though.

You’ll also want to do using LoopVectorization after the Pkg.add("LoopVectorization")

1 Like

BLAS is a specification with many implementations whose performance depends on the architecture. AVX and its variants are also only relevant to some architectures. So it’s important to specify your hardware, and it’s down to how well the hardware is used, not you “doing something wrong” in software at a high level.

I’ve heard that Intel’s recent AVX-512 work has been leveraged in Numpy for performance improvements, the mentions of @turbo using AVX-accelerated exp reminded me of that. But I don’t know if that’s the explanation for the performance discrepancy here, I don’t know which or how AVX factors into Julia or LLVM, I know very little about Julia’s vectorization implementation. I don’t even know why @turbo has not replaced @simd as standard-issue vectorization.

I would guess it probably has to do with the fact that it does not yet work as generically as @simd does.
For example the macro fails if one uses custom structs.
As long as @turbo needs to fall back to @inbounds @simd in some cases it will be hard to fully replace it.
That being said for me in the edge cases in where loop vectorization errored seemed to profit anyway from rewriting the code in a form which Loop vectorization prefers. It occasionally forces me to reorganize my code to utilize cache better, whereas with @simd i often would have just stopped as soon as it compiles.

so @simd works on custom structs?? I thought it only worked for certain numerical types, hence the usage of struct of arrays over array of structs in StructArrays.jl

To be honest, I only remember that it compiles but I’m not sure if i ever checked the @code_llvm it generates in this case. If i am serious about performance I always use LoopVectorization now.