Tullio, LoopVectorization: `pointer` becomes slow for views because of `Base._memory_offset` for `ndims(...) >= 5`

While improving the performance of some code using Tullio.jl and LoopVectorization.jl, I recognized a significant performance impact when using views of Arrays with at least five indices. Since LoopVectorization.jl works with pointers, I traced the issue back to

julia> using BenchmarkTools

julia> N = 5; data3big = randn(N,N,N,100); data3view = @view data3big[:,:,:,5]; data4big = randn(N,N,N,N,100); data4view = @view data4big[:,:,:,:,5];

julia> @benchmark pointer($data3view)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.104 ns (0.00% GC)
  median time:      2.106 ns (0.00% GC)
  mean time:        2.198 ns (0.00% GC)
  maximum time:     25.017 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark pointer($data4view)
BenchmarkTools.Trial: 
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.217 μs (0.00% GC)
  median time:      1.276 μs (0.00% GC)
  mean time:        1.364 μs (1.07% GC)
  maximum time:     149.512 μs (97.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

After some more digging into Base, it seems like the issue is

julia> @benchmark Base._memory_offset($(data3view.parent), $1, $1, $1, $5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.683 ns (0.00% GC)
  median time:      1.685 ns (0.00% GC)
  mean time:        1.737 ns (0.00% GC)
  maximum time:     19.646 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Base._memory_offset($(data4view.parent), $1, $1, $1, $1, $5)
BenchmarkTools.Trial: 
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.241 μs (0.00% GC)
  median time:      1.278 μs (0.00% GC)
  mean time:        1.374 μs (1.13% GC)
  maximum time:     158.943 μs (97.87% GC)
  --------------
  samples:          10000
  evals/sample:     10

Basically, Base._memory_offset becomes slow with five or more indices.

  • Did I do something wrong?
  • Is there an open issue for that?
  • How can I fix the performance?

I’m using Julia v1.5.1.

3 Likes

That’s quite the performance hit.
It’d be best if this were fixed in Base, but in the meantime we could fix it in VectorizationBase.jl
I’ll file a PR:

julia> my_memory_offset(x::DenseArray, I::Vararg{Any,N}) where {N} = (Base._to_linear_index(x, I...) - first(LinearIndices(x)))*Base.elsize(x)
my_memory_offset (generic function with 1 method)

julia> @benchmark my_memory_offset($(data4view.parent), 1, 1, 1, 1, 5)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.023 ns (0.00% GC)
  median time:      3.032 ns (0.00% GC)
  mean time:        3.049 ns (0.00% GC)
  maximum time:     13.230 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Base._memory_offset($(data4view.parent), 1, 1, 1, 1, 5)
BenchmarkTools.Trial:
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.473 μs (0.00% GC)
  median time:      1.541 μs (0.00% GC)
  mean time:        1.638 μs (1.74% GC)
  maximum time:     287.330 μs (99.05% GC)
  --------------
  samples:          10000
  evals/sample:     10

Current definition. Basically, a compiler heuristic says it isn’t worth specializing on I. Adding the N parameter forces specialization.

PR:
https://github.com/JuliaLang/julia/pull/37444

5 Likes