A curious case of allocating behavior

performance

#1

Hello,
while trying to implement a kind of array where each element is computed on demand by a function (the goal is to avoid storing the full array – it’s a generalized version of what is discussed here), I encountered a curious case of mysterious allocation behavior, and need some advice for how to proceed. In short, my module works properly and allocation-free for some functions, but causes allocations with other functions that are only slightly more complex.

The following is my (minimal) implementation for this array type. The constructor ComputedArray takes a function and a list of coordinate grids, such that if A=ComputedArray(f,xs,ys) then A[i,j]==f(xs[i],ys[j]) etc for higher dimensions.

module TestComputedArrays

import Base: size, getindex
export ComputedArray

struct ComputedArray{F,C,T,N} <: AbstractArray{T,N}
   fn     :: F
   coords :: C
end

function ComputedArray(fn, coords...)
   N = length(coords)
   argtypes = map(eltype, coords)
   T = Base.return_types(fn, argtypes)[1]
   ComputedArray{typeof(fn), typeof(coords), T, N}(fn, coords)
end

size(A::ComputedArray) = map(length, A.coords)

@generated function getindex(A::ComputedArray{F,C,T,N}, I::Vararg{Int,N}) where {F,C,T,N}
   ex = :( A.fn() )
   for d = 1:N
      push!(ex.args, :( A.coords[$d][I[$d]] ))
   end
   return ex
end

end # module

Here is some example code showing the curious behavior:

include("TestComputedArrays.jl")
using TestComputedArrays
using BenchmarkTools

v(a,b,c,d) = hypot(a-c,b-d)
w(a,b,c,d) = 1/hypot(a-c,b-d)

xs = linspace(0,1,6)
ys = collect(xs)

vy = ComputedArray(v,ys,ys,ys,ys)
wy = ComputedArray(w,ys,ys,ys,ys)
vx = ComputedArray(v,xs,xs,xs,xs)

println("ARRAY ACCESS")
@btime $vy[1,2,3,4]
@btime $wy[1,2,3,4]
@btime $vx[1,2,3,4]

println("COLLECT")
@btime collect($vy)
@btime collect($wy)
@btime collect($vx)

which produces the following output (Julia 0.6.0-rc2):

ARRAY ACCESS
7.957 ns (0 allocations: 0 bytes)
23.622 ns (1 allocation: 16 bytes)
48.681 ns (2 allocations: 224 bytes)
COLLECT
13.458 μs (2 allocations: 10.30 KiB)
56.907 μs (1298 allocations: 30.55 KiB)
82.694 μs (2594 allocations: 293.80 KiB)

Observations:

  • [vy] No allocations for the v function when the underlying coords are Arrays.
  • [wy] One allocation for the w function, which is simply the reciprocal of v. w itself is not allocating.
  • [vx] Two allocations if the underlying coords are linspaces instead of Arrays, although the common wisdom is that linspace is as performant as Array.
  • The COLLECT part shows that the allocations are indeed happening for each array access.
  • @code_warntype wy[1,2,3,4] reveals nothing suspicious, likewise for vx.

The behavior seems to depend on the “complexity” of the underlying function. It also seems to depend on dimensionality, e.g. if I use a function with only two or three arguments (instead of four as in the example), then I find the array access to be non-allocating (but the linspaces still cause allocations).

My questions:

  • What causes these allocations? My guess is that it’s related to inlining. Correct?
  • How can I avoid these allocations? I’m okay with forcing coords to be Arrays for simplicity, but it needs to be allocation-free even for “complex” functions. Perhaps the @generated function approach for my getindex is the culprit, but AFAICT it produces very straightforward code for each dimensionality N.

#2

Bingo. You’re hitting what is frequently referred to as the splatting penalty. Splatting arguments into varargs functions like getindex can be allocation-free, but only the function inlines into the caller. Otherwise, Julia needs to allocate a tuple on the heap in order to pass the indices to getindex. That’s where you’re seeing the bulk of the allocations come from with the more complicated versions of your object — the computation becomes complex enough that the indexing call no longer inlines. So you can add an inline annotation to the function to ensure it inlines and avoid the allocation.

Adding inline annotations to @generated functions is also tricky. Doing the obvious things (@inline @generated function …) doesn’t work because you’re actually annotating the generator function (and not the generated one). Instead, you can use the un-exported helper function to do this:

    return :(Base.@_inline_meta; $ex)

That puts the inline annotation into the generated function. See https://github.com/JuliaLang/julia/issues/13359 for more details.


#3

Thanks a lot, @mbauman! With the inline annotation, the allocations are gone for both the “complex” function and the linspace case. Perfect!