Tullio, LoopVectorization: `pointer` becomes slow for views because of `Base._memory_offset` for `ndims(...) >= 5`

While improving the performance of some code using Tullio.jl and LoopVectorization.jl, I recognized a significant performance impact when using views of Arrays with at least five indices. Since LoopVectorization.jl works with pointers, I traced the issue back to

julia> using BenchmarkTools

julia> N = 5; data3big = randn(N,N,N,100); data3view = @view data3big[:,:,:,5]; data4big = randn(N,N,N,N,100); data4view = @view data4big[:,:,:,:,5];

julia> @benchmark pointer($data3view)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.104 ns (0.00% GC)
  median time:      2.106 ns (0.00% GC)
  mean time:        2.198 ns (0.00% GC)
  maximum time:     25.017 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark pointer($data4view)
BenchmarkTools.Trial: 
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.217 μs (0.00% GC)
  median time:      1.276 μs (0.00% GC)
  mean time:        1.364 μs (1.07% GC)
  maximum time:     149.512 μs (97.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

After some more digging into Base, it seems like the issue is

julia> @benchmark Base._memory_offset($(data3view.parent), $1, $1, $1, $5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.683 ns (0.00% GC)
  median time:      1.685 ns (0.00% GC)
  mean time:        1.737 ns (0.00% GC)
  maximum time:     19.646 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Base._memory_offset($(data4view.parent), $1, $1, $1, $1, $5)
BenchmarkTools.Trial: 
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.241 μs (0.00% GC)
  median time:      1.278 μs (0.00% GC)
  mean time:        1.374 μs (1.13% GC)
  maximum time:     158.943 μs (97.87% GC)
  --------------
  samples:          10000
  evals/sample:     10

Basically, Base._memory_offset becomes slow with five or more indices.

  • Did I do something wrong?
  • Is there an open issue for that?
  • How can I fix the performance?

I’m using Julia v1.5.1.

That’s quite the performance hit.
It’d be best if this were fixed in Base, but in the meantime we could fix it in VectorizationBase.jl
I’ll file a PR:

julia> my_memory_offset(x::DenseArray, I::Vararg{Any,N}) where {N} = (Base._to_linear_index(x, I...) - first(LinearIndices(x)))*Base.elsize(x)
my_memory_offset (generic function with 1 method)

julia> @benchmark my_memory_offset($(data4view.parent), 1, 1, 1, 1, 5)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.023 ns (0.00% GC)
  median time:      3.032 ns (0.00% GC)
  mean time:        3.049 ns (0.00% GC)
  maximum time:     13.230 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Base._memory_offset($(data4view.parent), 1, 1, 1, 1, 5)
BenchmarkTools.Trial:
  memory estimate:  848 bytes
  allocs estimate:  21
  --------------
  minimum time:     1.473 μs (0.00% GC)
  median time:      1.541 μs (0.00% GC)
  mean time:        1.638 μs (1.74% GC)
  maximum time:     287.330 μs (99.05% GC)
  --------------
  samples:          10000
  evals/sample:     10

Current definition. Basically, a compiler heuristic says it isn’t worth specializing on I. Adding the N parameter forces specialization.

PR:
https://github.com/JuliaLang/julia/pull/37444