Vector addition in Julia slower than numpy in Linux

I tried this on Julia 1.4 on Ubuntu 19.10:

import numpy as np
a = np.random.rand(500_000)
b = np.random.rand(500_000)
%timeit a + b
115 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using BenchmarkTools
a = rand(500_000)
b = rand(500_000)
@btime a .+ b;
545.357 μs (2 allocations: 3.81 MiB)

Strangely, I tried it on a Macbook and Julia was actually faster than numpy, maybe by 20% or something.

This

a = np.random.rand(500_000)

errors with

    a = np.random.rand(500_000)
                             ^
SyntaxError: invalid syntax

Maybe it’s because of this line

%timeit a + b

You need to run this in IPython console, %timeit command won’t work otherwise.

I’m seeing similar times on my machine:

>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.0004245378542691469
julia> @btime $a .+ $b
  436.089 μs (2 allocations: 781.33 KiB)

If you want to pull out the big guns,

julia> using LoopVectorization

julia> @btime vmapnt!(+, $c, $a, $b)
  80.321 μs (0 allocations: 0 bytes)

edit: machine info

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
2 Likes

The followings are the results on my machine. Python is slightly faster.

import numpy as np
a = np.random.rand(500_000)
b = np.random.rand(500_000)
%timeit a + b
368 µs ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using BenchmarkTools
a = rand(500_000)
b = rand(500_000)
@btime a .+ b;
495.616 μs (4 allocations: 3.81 MiB)

The OS information,

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

That was with python 2 and possibly an older version of numpy. In any case, underscores in integer literals work in python 3 for me; and the error was a red herring.

Our Julia run times are more or less close, but my numpy timing’s ~4X faster than yours. So maybe it’s something new in numpy? What’s your numpy version? Mine’s 1.17.2

Same comment as @stillyslalom, our Julia timing matches, but not numpy’s. What’s your numpy version. Given that based on @stillyslalom’s values using LoopVectorization, I’m guessing that maybe that’s what they do in numpy?

With the most recent version of Numpy, I get the same timings:

>>> np.__version__
'1.18.3'
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.00043422764912247656

numpy version: 1.17.4
python version: 3.7.6

OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core™ i5-8500 CPU @ 3.00GHz
WORD_SIZE: 64

Is numpy using multithreading?

That is an impressive speedup!

I’m not seeing the magnitude, and not before a much larger size.
First, to get it out of the way

>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551134126028046
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551150968996808

I actually happen to see worse performance at this size for vmapnt!, but performance is pretty similar to Python:

julia> using LoopVectorization, BenchmarkTools

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime $a .+ $b;
  563.405 μs (2 allocations: 3.81 MiB)

julia> @btime vmapnt!(+, $c1, $a, $b);
  522.161 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  511.718 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

But if I make the arrays 10x smaller, I do see an improvement:

julia> a = rand(50_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  27.126 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  31.831 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

Of course, performance improves by more than 10x because we’re far into the realm of memory-bottlenecks. But that also means that the nontemporal stores weren’t helping for vectors of length 500_000.
On my computer, the L3 cache can fit over 3,200,000 doubles.

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ sizeof(Float64)
3244032

julia> versioninfo()
Julia Version 1.5.0-DEV.696
Commit e09b3d6d1a* (2020-04-26 16:15 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, skylake)

So it can comfortably hold all three arrays.
So, lets see what happens if two arrays just fit in the L3 cache.

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ (2sizeof(Float64))
1622016

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  1.616 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  1.706 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Small difference. Make the arrays 50% longer:

julia> a = rand(3L >>> 1); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  2.668 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  3.060 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Letting 1 array fill the cache:

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  3.411 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  4.135 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Only half an array fits in the L3:

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  7.380 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  9.140 ms (0 allocations: 0 bytes)

julia> c1 == c2
true
Dumps for L1 and L2 cache

For the L1 cache:

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  827.247 ns (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  393.866 ns (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  1.864 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  720.099 ns (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  3.767 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  1.391 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

and the L2:

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[2] ÷ (2sizeof(Float64))
65536

julia> a = rand(L >>> 1); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  15.050 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  6.783 μs (0 allocations: 0 bytes)

julia> a = rand(3L >>> 2); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  27.858 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  25.433 μs (0 allocations: 0 bytes)

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  43.216 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  59.119 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  137.887 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  131.966 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  277.129 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  267.334 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

On my computer, when streaming single threaded additions, the nontemporal stores seem to lead to beter performance when either:

  • a and b are much larger than the L3 cache
  • a and b just barely fit in the L2 cache.

And equal or worse performance otherwise. I never get anything close to your 5x gain.

I need to use multithreading to just barely beat your time

julia> using LoopVectorization, BenchmarkTools

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapntt!(+, $c1, $a, $b);
  64.925 μs (93 allocations: 12.89 KiB)
3 Likes

On my (slow) machine julia is faster:

Python 3.8.2 (default, Apr  8 2020, 14:31:25) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np 
   ...: a = np.random.rand(500_000) 
   ...: b = np.random.rand(500_000) 
   ...: %timeit a + b                                                                          
1.29 ms ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: np.__version__                                                                         
Out[2]: '1.18.3'
julia> using BenchmarkTools

julia> a = rand(500_000);

julia> b = rand(500_000);

julia> @btime a .+ b;
  1.195 ms (4 allocations: 3.81 MiB)

julia> @btime $a .+ $b;
  1.176 ms (2 allocations: 3.81 MiB)

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, broadwell)

Edit: Don’t know if it is fairer to compare with or without interpolation…

I think Python uses multithreading. On my system

In [3]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=1000)/1000 
   ...:                                                                                                       
Out[3]: 0.0003835020289999989

In [4]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=1000)/1000 
   ...:                                                                                                       
Out[4]: 0.0003862037549999968

In [5]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=100000)/100000                                                                                     
Out[5]: 0.0003821436223

In the last run I used a huge number of repetitions to clearly see the CPU usage of Python using about 3 cores (on my quad core macbook pro, but with many other things running at the same time).

Using export JULIA_NUM_THREADS=4 and using Strided

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime $a .+ $b;
  470.189 μs (2 allocations: 3.81 MiB)

julia> @btime @strided $a .+ $b;
  417.899 μs (38 allocations: 3.82 MiB)

julia> @btime $c1 .= $a .+ $b;
  451.231 μs (0 allocations: 0 bytes)

julia> @btime @strided $c2 .= $a .+ $b;
  319.303 μs (34 allocations: 3.38 KiB)

Not sure why there is such a big difference between the allocating and non-allocating version. But certainly the threading strategy is overly complicated (and thus creating too much overhead) for such a simple problem.

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Possibly of relevance: sum(a) is now 30% slower than NumPy · Issue #30290 · JuliaLang/julia · GitHub

I guess that’s the case. But even so, I started Julia with 4 threads, set number of BLAS threads to 4, and still numpy beats Julia by 4X.

I think in Julia a .+ b is inherently single threaded, even if you start Julia with 4 threads.

The extreme speedup I saw was (sadly) due to user error: I accidentally used arrays of length 100,000 rather than 500,000 for my vmapnt! run. Don’t try to benchmark past midnight, folks!

julia> @btime @. $c1 = $a + $b;
  407.343 μs (0 allocations: 0 bytes)

julia> @btime vmapnt!(+, $c1, $a, $b);
  407.012 μs (0 allocations: 0 bytes)
2 Likes

I see. But, strangely, Julia matched numpy in my Macbook, I only see the 4X mismatch on Ubuntu 19.10.

Something odd going on with Ubuntu maybe. On MacOS it’s pretty close:

In [11]: %timeit a + b                                                          
500 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

julia> @btime $a .+ $b;
  500.413 μs (2 allocations: 3.81 MiB)

Numpy 1.18.1
Julia 1.4.0