Comparing exp() performance on Julia versus numpy

torrance · August 5, 2022, 6:08am

So I was actually preparing a demo to show off Julia to colleagues when I encountered this issue.

I want to simply compute the exponential of an array. In Julia (v. 1.7.2):

arr = rand(100_000)
@benchmark exp.($arr)

# Time  (median): 576.806 μs GC (median): 0.00%
# Memory estimate: 781.30 KiB, allocs estimate: 2.

Compared to Python:

arr = np.random.rand(100000)
timeit.timeit(lambda: np.exp(arr), number=100000) / 100000
# 0.0001027145623601973

So Python is about a factor of 6 faster.

Of course I’ve also tried wrapping the Julia dot call inside a function, as well as manually looping myself, e.g.:

@inbounds function myfunc(arr)
    result = similar(arr)
    @simd for i in eachindex(arr)
        result[i] = exp(arr[i]) 
    end
    return result
end

@benchmark myfunc($arr)
# Time  (median):     569.613 μs

…which is identical timing to the dotted version.

Any idea what’s going on here? Some obvious sanity checks - both are Float64, and both are not using fastmath, both are allocating a new array and both are single threaded.

DNF · August 5, 2022, 6:22am

Edit: saw now that you said no multithreading.

But are you actually certain that numpy is not implicitly doing multithreading under the hood?

torrance · August 5, 2022, 6:26am

@DNF I checked that: both are single threaded.

DNF · August 5, 2022, 6:33am

I’m no good at reading llvm code, so I cannot confirm, but it might be that Julia fails to simd the computation. If you use LoopVectorization.jl, it works:

julia> @btime exp.($x);
  503.300 μs (2 allocations: 781.30 KiB)

julia> @btime @turbo exp.($x);
  91.100 μs (2 allocations: 781.30 KiB)

torrance · August 5, 2022, 6:53am

Yeah, I think that’s it! I see near identical timings between numpy and Julia with @turbo. @turbo also speeds up my my own explicit function, even though this already included a @simd macro.

So guess the next question is: why doesn’t the dot call automatically apply simd instructions? And why does my explicit function with manual @simd not work?

(I’ve checked whether this slowdown is attributed to the slow handling of tails, as it mentioned on the LoopVectorization page, but passing in arrays without tails does not make any difference in my own testing.)

DNF · August 5, 2022, 7:13am

It’s a good question. I am not able to figure it out, but I believe LoopVectorization.jl in some cases uses other implementations of some functions which are hard to simd. Probably, exp is one of them.

Julia is just using general mechanisms to calculate exp over your array, while numpy is free to do a special array version, maybe much like LoopVectorization does. Making simd happen in Julia in this case apparently takes more than a simple @simd annotation.

ikirill · August 5, 2022, 7:14am

I just tried this, and I’m getting 402us in Julia 1.7.3 and 435us with numpy. I would have expected them to be completely equal, assuming they are doing the same thing. I’m surprised there’s a difference one way or the other.

suavesito · August 5, 2022, 7:17am

Curiously, in my machine I got the numpy code to run np.exp(arr) on ~960\mu s, the no LV.jl code on ~712\mu s and the @turbo’ed code in ~142\mu s.

I got the same results using v1.8-rc3 or 1.7.3.

ufechner7 · August 5, 2022, 11:03am

Which CPU do you have?

oxinabox · August 5, 2022, 11:38am

That is probably not a good assumption.
I suspect numpy uses your system libM,
julia uses implementations of math functions written in julia – following the same techniques libM authors use in terms of balancing speed and accurasy but potentially ending up at a different point.
Even if exp isn’t one of the functions we have tweaked and optimized it would still be based on the OpenLibM definition which is probably not the same libM that your system runs.

acxz · August 5, 2022, 12:55pm

Just as a another data point:

vanilla julia:
600 us

turbo julia:
150 us

python numpy:
650 us

CPU: Intel i7-8850H (6) @ 2.6GHz

Sukera · August 5, 2022, 1:02pm

Possibly related:

torrance · August 5, 2022, 1:14pm

The speedup in Python is very dependent on how numpy has been compiled. I have tested on various system and see the numpy code variably matching the speed of the @turbo Julia or alternatively the plain julia. Obviously some are making use of simd instructions, and some are not.

That’s fine, but doesn’t address the question I asked earlier:

Why isn’t exp.(arr) automatically applying simd optimizations?
And why doesn’t my explicit @simd annotation in the manual loop function apply simd instructions, either?

jling · August 5, 2022, 1:48pm

we generally don’t do this.

see the linked PR, and try this on nightly to see if you observe any difference

Oscar_Smith · August 5, 2022, 1:51pm

Basically the reason is that Julia can do this if it is able to inline the code, but generically turning scalar code into vector code through the compiler only is a very complicated optimization.

jling · August 5, 2022, 1:53pm

maybe there is something to python vectorized style then /s

and, *cough, vmap…

mikmoore · August 5, 2022, 1:54pm

There is no reason to expect Julia to outperform Python on a single call like this. A significant difference in either direction is just low fruit for the other (in this case, the Julia version failing to SIMD). This function is already optimized in Python (or, rather, in C or whatever language was used to write it under the hood). Julia’s performance benefits only really factor in when you are trying to do more complicated sequences of operations that lack a bespoke implementation.

Oscar_Smith · August 5, 2022, 1:56pm

One thing we could that would make this better is to have a better interface to allow the compiler to replace scalar code with custom provided vector code. I’m not fully sure what that interface looks like though.

jling · August 5, 2022, 1:59pm

1.8 rc-3

julia> @btime exp.(x) setup=(x = rand(100_000));
  383.979 μs (2 allocations: 781.30 KiB)

current master:

Julia Version 1.9.0-DEV.1086
Commit 01c0778057* (2022-08-05 06:31 UTC)

julia> @btime exp.(x) setup=(x = rand(100_000));
  171.195 μs (2 allocations: 781.30 KiB)

(AMD CPU, so no avx512)

jishnub · August 5, 2022, 2:00pm

Perhaps I’m misunderstanding, but doesn’t the copyto! invoked in broadcasting use the @simd annotation?

github.com

JuliaLang/julia/blob/01c07780573b49db95be2f384f31a869fb6ee4a7/base/broadcast.jl#L973-L975


      
          @inbounds @simd for I in eachindex(bc′)
              dest[I] = bc′[I]
          end

Topic		Replies	Views
Small benchmark Performance benchmark	14	2836	November 21, 2018
Broadcast function performance of Julia vs Numpy vs Nd4J Performance	12	528	April 28, 2023
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8588	March 5, 2017
Slow arbitrary base exponentiation, a^b Performance	22	2835	June 14, 2020
Dot function General Usage	45	5645	September 26, 2018

Comparing exp() performance on Julia versus numpy

Related topics