That is an impressive speedup!
I’m not seeing the magnitude, and not before a much larger size.
First, to get it out of the way
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551134126028046
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551150968996808
I actually happen to see worse performance at this size for vmapnt!
, but performance is pretty similar to Python:
julia> using LoopVectorization, BenchmarkTools
julia> a = rand(500_000); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime $a .+ $b;
563.405 μs (2 allocations: 3.81 MiB)
julia> @btime vmapnt!(+, $c1, $a, $b);
522.161 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
511.718 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
But if I make the arrays 10x smaller, I do see an improvement:
julia> a = rand(50_000); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
27.126 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
31.831 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
Of course, performance improves by more than 10x because we’re far into the realm of memory-bottlenecks. But that also means that the nontemporal stores weren’t helping for vectors of length 500_000.
On my computer, the L3 cache can fit over 3,200,000 doubles.
julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ sizeof(Float64)
3244032
julia> versioninfo()
Julia Version 1.5.0-DEV.696
Commit e09b3d6d1a* (2020-04-26 16:15 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-10.0.0 (ORCJIT, skylake)
So it can comfortably hold all three arrays.
So, lets see what happens if two arrays just fit in the L3 cache.
julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ (2sizeof(Float64))
1622016
julia> a = rand(L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
1.616 ms (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
1.706 ms (0 allocations: 0 bytes)
julia> c1 == c2
true
Small difference. Make the arrays 50% longer:
julia> a = rand(3L >>> 1); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
2.668 ms (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
3.060 ms (0 allocations: 0 bytes)
julia> c1 == c2
true
Letting 1 array fill the cache:
julia> a = rand(2L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
3.411 ms (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
4.135 ms (0 allocations: 0 bytes)
julia> c1 == c2
true
Only half an array fits in the L3:
julia> a = rand(4L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
7.380 ms (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
9.140 ms (0 allocations: 0 bytes)
julia> c1 == c2
true
Dumps for L1 and L2 cache
For the L1 cache:
julia> a = rand(L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
827.247 ns (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
393.866 ns (0 allocations: 0 bytes)
julia> c1 == c2
true
julia> a = rand(2L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
1.864 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
720.099 ns (0 allocations: 0 bytes)
julia> c1 == c2
true
julia> a = rand(4L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
3.767 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
1.391 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
and the L2:
julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[2] ÷ (2sizeof(Float64))
65536
julia> a = rand(L >>> 1); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
15.050 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
6.783 μs (0 allocations: 0 bytes)
julia> a = rand(3L >>> 2); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
27.858 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
25.433 μs (0 allocations: 0 bytes)
julia> a = rand(L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
43.216 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
59.119 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
julia> a = rand(2L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
137.887 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
131.966 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
julia> a = rand(4L); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapnt!(+, $c1, $a, $b);
277.129 μs (0 allocations: 0 bytes)
julia> @btime $c2 .= $a .+ $b;
267.334 μs (0 allocations: 0 bytes)
julia> c1 == c2
true
On my computer, when streaming single threaded additions, the nontemporal stores seem to lead to beter performance when either:
-
a
and b
are much larger than the L3 cache
-
a
and b
just barely fit in the L2 cache.
And equal or worse performance otherwise. I never get anything close to your 5x gain.
I need to use multithreading to just barely beat your time
julia> using LoopVectorization, BenchmarkTools
julia> a = rand(500_000); b = rand(length(a));
julia> c1 = similar(a); c2 = similar(a);
julia> @btime vmapntt!(+, $c1, $a, $b);
64.925 μs (93 allocations: 12.89 KiB)