Comparing exp() performance on Julia versus numpy

So I was actually preparing a demo to show off Julia to colleagues when I encountered this issue.

I want to simply compute the exponential of an array. In Julia (v. 1.7.2):

arr = rand(100_000)
@benchmark exp.($arr)

# Time  (median): 576.806 μs GC (median): 0.00%
# Memory estimate: 781.30 KiB, allocs estimate: 2.

Compared to Python:

arr = np.random.rand(100000)
timeit.timeit(lambda: np.exp(arr), number=100000) / 100000
# 0.0001027145623601973

So Python is about a factor of 6 faster.

Of course I’ve also tried wrapping the Julia dot call inside a function, as well as manually looping myself, e.g.:

@inbounds function myfunc(arr)
    result = similar(arr)
    @simd for i in eachindex(arr)
        result[i] = exp(arr[i]) 
    end
    return result
end

@benchmark myfunc($arr)
# Time  (median):     569.613 μs

…which is identical timing to the dotted version.

Any idea what’s going on here? Some obvious sanity checks - both are Float64, and both are not using fastmath, both are allocating a new array and both are single threaded.

Edit: saw now that you said no multithreading.

But are you actually certain that numpy is not implicitly doing multithreading under the hood?

@DNF I checked that: both are single threaded.

I’m no good at reading llvm code, so I cannot confirm, but it might be that Julia fails to simd the computation. If you use LoopVectorization.jl, it works:

julia> @btime exp.($x);
  503.300 μs (2 allocations: 781.30 KiB)

julia> @btime @turbo exp.($x);
  91.100 μs (2 allocations: 781.30 KiB)
1 Like

Yeah, I think that’s it! I see near identical timings between numpy and Julia with @turbo. @turbo also speeds up my my own explicit function, even though this already included a @simd macro.

So guess the next question is: why doesn’t the dot call automatically apply simd instructions? And why does my explicit function with manual @simd not work?

(I’ve checked whether this slowdown is attributed to the slow handling of tails, as it mentioned on the LoopVectorization page, but passing in arrays without tails does not make any difference in my own testing.)

2 Likes

It’s a good question. I am not able to figure it out, but I believe LoopVectorization.jl in some cases uses other implementations of some functions which are hard to simd. Probably, exp is one of them.

Julia is just using general mechanisms to calculate exp over your array, while numpy is free to do a special array version, maybe much like LoopVectorization does. Making simd happen in Julia in this case apparently takes more than a simple @simd annotation.

I just tried this, and I’m getting 402us in Julia 1.7.3 and 435us with numpy. I would have expected them to be completely equal, assuming they are doing the same thing. I’m surprised there’s a difference one way or the other.

1 Like

Curiously, in my machine I got the numpy code to run np.exp(arr) on ~960\mu s, the no LV.jl code on ~712\mu s and the @turbo’ed code in ~142\mu s.

I got the same results using v1.8-rc3 or 1.7.3.

Which CPU do you have?

That is probably not a good assumption.
I suspect numpy uses your system libM,
julia uses implementations of math functions written in julia – following the same techniques libM authors use in terms of balancing speed and accurasy but potentially ending up at a different point.
Even if exp isn’t one of the functions we have tweaked and optimized it would still be based on the OpenLibM definition which is probably not the same libM that your system runs.

3 Likes

Just as a another data point:

vanilla julia:
600 us

turbo julia:
150 us

python numpy:
650 us

CPU: Intel i7-8850H (6) @ 2.6GHz

1 Like

Possibly related:

2 Likes

The speedup in Python is very dependent on how numpy has been compiled. I have tested on various system and see the numpy code variably matching the speed of the @turbo Julia or alternatively the plain julia. Obviously some are making use of simd instructions, and some are not.

That’s fine, but doesn’t address the question I asked earlier:

  1. Why isn’t exp.(arr) automatically applying simd optimizations?
  2. And why doesn’t my explicit @simd annotation in the manual loop function apply simd instructions, either?
3 Likes

we generally don’t do this.

see the linked PR, and try this on nightly to see if you observe any difference

Basically the reason is that Julia can do this if it is able to inline the code, but generically turning scalar code into vector code through the compiler only is a very complicated optimization.

maybe there is something to python vectorized style then /s

and, *cough, vmap

1 Like

There is no reason to expect Julia to outperform Python on a single call like this. A significant difference in either direction is just low fruit for the other (in this case, the Julia version failing to SIMD). This function is already optimized in Python (or, rather, in C or whatever language was used to write it under the hood). Julia’s performance benefits only really factor in when you are trying to do more complicated sequences of operations that lack a bespoke implementation.

3 Likes

One thing we could that would make this better is to have a better interface to allow the compiler to replace scalar code with custom provided vector code. I’m not fully sure what that interface looks like though.

1.8 rc-3

julia> @btime exp.(x) setup=(x = rand(100_000));
  383.979 μs (2 allocations: 781.30 KiB)

current master:

Julia Version 1.9.0-DEV.1086
Commit 01c0778057* (2022-08-05 06:31 UTC)

julia> @btime exp.(x) setup=(x = rand(100_000));
  171.195 μs (2 allocations: 781.30 KiB)

(AMD CPU, so no avx512)

4 Likes

Perhaps I’m misunderstanding, but doesn’t the copyto! invoked in broadcasting use the @simd annotation?