While improving the performance of some code using Tullio.jl and LoopVectorization.jl, I recognized a significant performance impact when using view
s of Array
s with at least five indices. Since LoopVectorization.jl works with pointers, I traced the issue back to
julia> using BenchmarkTools
julia> N = 5; data3big = randn(N,N,N,100); data3view = @view data3big[:,:,:,5]; data4big = randn(N,N,N,N,100); data4view = @view data4big[:,:,:,:,5];
julia> @benchmark pointer($data3view)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.104 ns (0.00% GC)
median time: 2.106 ns (0.00% GC)
mean time: 2.198 ns (0.00% GC)
maximum time: 25.017 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark pointer($data4view)
BenchmarkTools.Trial:
memory estimate: 848 bytes
allocs estimate: 21
--------------
minimum time: 1.217 μs (0.00% GC)
median time: 1.276 μs (0.00% GC)
mean time: 1.364 μs (1.07% GC)
maximum time: 149.512 μs (97.88% GC)
--------------
samples: 10000
evals/sample: 10
After some more digging into Base, it seems like the issue is
julia> @benchmark Base._memory_offset($(data3view.parent), $1, $1, $1, $5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.683 ns (0.00% GC)
median time: 1.685 ns (0.00% GC)
mean time: 1.737 ns (0.00% GC)
maximum time: 19.646 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark Base._memory_offset($(data4view.parent), $1, $1, $1, $1, $5)
BenchmarkTools.Trial:
memory estimate: 848 bytes
allocs estimate: 21
--------------
minimum time: 1.241 μs (0.00% GC)
median time: 1.278 μs (0.00% GC)
mean time: 1.374 μs (1.13% GC)
maximum time: 158.943 μs (97.87% GC)
--------------
samples: 10000
evals/sample: 10
Basically, Base._memory_offset
becomes slow with five or more indices.
- Did I do something wrong?
- Is there an open issue for that?
- How can I fix the performance?
I’m using Julia v1.5.1
.