Seemingly unnecessary allocation within for loop

I’m trying to create a simulation of an elastic string generalized to an arbitrary number of dimensions. To do so I track position and momentum (with an extra array to store the intermediate positions), and use the Euler Method to update. Conceptually, for the i^{th} cell of the array, an update from iteration n-1 to iteration n is:

h_i^{(n-1)} = x_i^{(n-1)} - \sum_a x_a^{(n-1)}
p_i^{(n)} = p_{i}^{(n-1)} - h_i^{(n-1)}
x_i^{(n)} = p_i^{(n)} + x_i^{(n-1)}

…where h is the “height” of a cell relative to its adjacent cells (a–in two dimensions it would be i-1 and i+1), p is the momentum, and x is the position. I’ve stripped mass, spring coefficients, damping, etc. for simplicity.

The simplified code is below. I ran it with julia --track-allocation=user, and have added a comment with the number of bytes shown in resultant the *.mem file to the end of each line with non-zero allocation count.

using Profile

mutable struct ElasticOrthotope
    x::Array{Float64} # position
    y::Array{Float64} # "next" position
    p::Array{Float64} # momentum
    function ElasticOrthotope(m::Integer...)
        return new((zeros(m...) for i=1:3)...)
    end
end

function update!(eo::ElasticOrthotope, t::Integer=1)
    n = ndims(eo.x)
    adjacents = Tuple(CartesianIndex(Tuple(j==k ? a : 0 for k=1:n)) for j=1:n for a=(-1,1)) # 112
    for _=1:t
        @inbounds for i in CartesianIndices(Tuple(2:lastindex(eo.x, j)-1 for j=1:n))        # 68800
            s = sum(eo.x[i+a] for a in adjacents)                                           # 115200
            h = eo.x[i] - s / 2n                                                            # 38400
            eo.p[i] -= h                                                                    # 25600
            eo.y[i] = eo.x[i] + eo.p[i]                                                     # 7200
        end
        eo.x, eo.y = eo.y, eo.x
    end
    return eo
end

eo = ElasticOrthotope(4, 6)
eo.p[2,2] = 0.1
update!(eo) # compile first
Profile.clear_malloc_data()
update!(eo, 100)

What can I do to get rid of the allocations? The one outside of the loop makes sense to me, but I can’t wrap my head around why anything within the loop (except maybe sum) would result in anything being put on the heap.

This is secondary, but is there a way to calculate s without a comprehension so that I can use @turbo around the loop? I’m not sure how much of a performance difference it would make, but I’ve seen @turbo outperform @inbounds in some cases.

Context: I’m making a video on Julia for a class I’m teaching. I’ve already used this example (but in C++) for several videos, so for cohesiveness I’d like to keep it. I was able to use broadcasting without allocating (but I have to iterate over more memory so it’s slower), and using a macro or a generator has yielded the same speed as the C++ code, but my goal is to show students that it’s possible to get as fast as C++, and more versatile with less effort, without resorting to metaprogramming.

Array is an abstract type. I suggest you use:

mutable struct ElasticOrthotope{A}
    x::A
    ...

That should improve things, to start.

3 Likes

Maybe even without making the struct mutable. Immutability here will not prevent updating or growing the inner arrays,

1 Like

In particular, the concrete type you want is Array{Float64,1} (1-dimensional), or equivalently Vector{Float64}.

(Note that if you want to implement a simple explicit timestepping scheme for a wave equation, I would recommend a leapfrog scheme rather than Euler.)

2 Likes

In the example given above it’s Array{Float64,2} (or Matrix{Float64}), but maybe it’s supposed to work for any dimension.

This is type-unstable.

(Also, it seems like you are storing all of the timesteps? Normally, one would only store the current and previous timestep or similar in this kind of simulation; otherwise the memory usage is proportional to the simulation time.)

1 Like

Also, your equations look wrong to me. It seems like you are trying to model a scalar wave equation \ddot{x} = \nabla^2 x, but your equations are not quite a discretization of this equation. For example, the first term in the first equation should be multiplied by twice the number of dimensions.

In general, I would look into discretization schemes for hyperbolic equations (e.g. the leapfrog scheme I mentioned), because there are lots of ways to go wrong here. Or you could use DifferentialEquations.jl to handle the time integration (i.e. “method of lines”).

Note that a useful technique is to use “ghost cells” to handle boundary conditions, as described here: Finite difference Laplacian with five-point stencil - #2 by stevengj

3 Likes

For fun, I wrote out an implementation of allocation-free explicit timestepping for an N-dimensional scalar wave equation, using ghost-cell boundary conditions and centered-difference approximations (equivalent to a leapfrog scheme):

"""
    u, uprev, unew = wavepropagate!(u, cΔtΔx⁻¹=0.9/√N, uprev=copy(u), unew=copy(u))

Time-evolve `u` according to `timesteps` timesteps of the equation ∂²u/∂t² = c²∇²u,
using centered-difference approximations in space and time with discretization
size Δx and Δt, respectively, given the ratio cΔtΔx⁻¹(which must be < 1/√N
for CFL stability).

Dirichlet boundary conditions are used, based on the initial boundary values in `u`.
The `uprev` array should contain `u` at the previous timestep, defaulting to `copy(u)`
(corresponding to initial conditions at rest).   The `unew` array is used for storing
the subsequent timestep and should be of the same size and type as `u`.
"""
function wavepropagate!(u::AbstractArray{T, N}, timesteps::Integer,
                       cΔtΔx⁻¹::Real=0.9/√N,
                       uprev::AbstractArray{T, N}=copy(u),
                       unew::AbstractArray{T, N}=copy(u)) where {T<:Number,N}
    cΔtΔx⁻¹ < 1/√N || throw(ArgumentError("cΔtΔx⁻¹ = $cΔtΔx⁻¹ violates CFL stability condition"))
    c²Δt²Δx⁻² = (cΔtΔx⁻¹)^2
    
    # construct tuple of unit vectors along N directions, for constructing nearest neighbors
    # ... using ntuple with Val(N) makes this type-stable with hard-coded loops
    unitvecs = ntuple(i -> CartesianIndex(ntuple(==(i), Val(N))), Val(N))
    
    I = CartesianIndices(u)
    Ifirst, Ilast = first(I), last(I)
    I1 = oneunit(Ifirst)
    
    for t = 1:timesteps
    
        # use the "ghost cell" technique for boundary conditions:
        # only timestep in the interior, and update boundaries separately (if at all)
        @inbounds for i in Ifirst+I1:Ilast-I1
            # compute a discrete (center-difference) Laplacian of u
            ∇²u = -2N * u[i]
            for uvec in unitvecs
                ∇²u += u[i+uvec]+u[i-uvec]
            end
            
            # update u via center-difference approximation of ∂²u/∂t² = c²∇²u
            unew[i] = 2u[i] - uprev[i] + c²Δt²Δx⁻² * ∇²u
        end
        
        # here, you would update the boundary "pixels" of xnew based on whatever
        # boundary condition you want.   not updating them, as we do here,
        # corresponds to Dirichlet boundary conditions
        
        u, uprev, unew = unew, u, uprev # cycle the arrays
    end
    
    return u, uprev, unew
end

For example, here is the output in 1d, where you can see an initial “pulse” propagating outwards:

using PyPlot
x = range(-1, 1, length=200)
u = @. exp(-(x/0.1)^2)
plot(x, u)
uprev, unew = copy(u), copy(u)
for i = 1:4
    u, uprev, unew = wavepropagate!(u, 40, sqrt(0.2), uprev, unew)
    plot(x, u)
end
xlabel(L"x")
ylabel(L"u(x,t)")
legend([L"t = %$n \Delta t" for n = (0:5)*40])

20 Likes

Wow. Thanks for sharing!

For reference, pages 128-129 on “overlap regions” in chapter 5 of Principles of Parallel Programming by Lin and Snyder, which you refer to in a related thread (as JPGs because discourse does not allow me to post PDF):


Would it be possible to highlight some differences/benefits of the ghost-cell technique compared to the perfectly matched layer technique?
Thank you.

Ghost cells are a simple software-engineering trick to impose boundary conditions in mesh-discretization methods — you add an extra layer of pixels around the boundary to implement the boundary conditions separate from the main computational code. This way, your main computational code (e.g. the computation of the discrete Laplacian above) doesn’t need to do anything special at the boundaries, because it is only looping over the interior. They are similarly used for distributed sparse matrix–vector operations (e.g. you’ll see them in PartitionedArrays.jl and PETSc) to separate the communication parts (ghost-cell updates) from the matrix–vector computation (which can then use serial matrix–vector multiplication routines).

A PML is not a boundary condition at all, so it is completely unrelated to ghost cells. A PML, instead, is a mathematical transformation to a wave equation, usually applied in a layer adjacent to the boundaries, that transforms propagating waves into decaying waves without creating reflections. This is usually employed to implement outgoing “radiation” boundary conditions, but as I said it is actually a boundary layer (transformed equations) and not a boundary condition per se — in fact, a PML makes the boundary condition of the adjacent boundary nearly irrelevant, because the waves have mostly decayed away by the time they reach the boundary. (But you still need a boundary condition! So often you use both a PML and ghost cells!). See my PML notes.

5 Likes

For some reason, if I replace this with ∇²u = -2N * u[i] + sum(uvec -> u[i+uvec]+u[i-uvec], unitvecs), I get a large number of allocations (and complaints from @code_warntype). This surprises me, since unitvecs should be a type-stable tuple and hence the sum call should be type-stable and unrolled. Any ideas?

2 Likes

Looks like an inference bug triggered by referencing u inside the closure. It seems like wrapping the code using a let block works around the issue:

∇²u = let u = u
    -2N * u[i] + sum(uvec -> u[i+uvec]+u[i-uvec], unitvecs)
end

EDIT: apparently, the issue disappears if one comments the line:

u, uprev, unew = unew, u, uprev # cycle the arrays

Even though all of the arrays have the same type, the compiler seems to get confused by this array relabelling.

2 Likes

Thanks, filed inference failure on array relabeling · Issue #43352 · JuliaLang/julia · GitHub

Tuns out it’s just performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub again. There will be much rejoicing when we finally fix that one.

7 Likes

@lmiq: I parameterized ElasticOrthotope so that the types it holds are stable:

mutable struct ElasticOrthotope{N}
    x::Array{Float64,N}
    y::Array{Float64,N}
    p::Array{Float64,N}
    function ElasticOrthotope(m::Integer...)
        return new{length(m)}((zeros(m...) for i=1:3)...)
    end
end

This seems to have made a huge difference–@code_warntype went crazy before but settles down with the change.

@stevengj: stable indices (by using ntuple to create CartesianIndices) made a world of difference, and with a couple tweaks this is as fast as the C++ code with no metaprogramming:

function update!(eo::ElasticOrthotope{N}) where N
    x, y, p = (OffsetArray(a, ntuple(i->-1, N)...) for a in (eo.x, eo.y, eo.p))
    @inbounds @fastmath @simd for i in CartesianIndices(ntuple(s->1:size(eo.x)[s]-2, N))
        s = 0.0
        for j=1:N
            u = CartesianIndex(ntuple(==(j), N))
            s += x[i-u] + x[i+u]
        end
        p[i] -= x[i] - s / 2N
        y[i] = x[i] + p[i]
    end
    eo.x, eo.y = eo.y, eo.x
    return eo
end

I also made a leapfrog version that’s pretty fast:

function updatebcast!(eo::ElasticOrthotope{N}) where N
    interior = CartesianIndices(ntuple(i->2:size(eo.x)[i]-1, N))
    adjacent = ntuple(i->CartesianIndex(ntuple(==(i), N)), N)
    x = view(eo.x, interior)
    p = view(eo.p, interior)
    xa = (view(eo.x, interior.+a) for a in (adjacent..., map(-, adjacent)...))
    @turbo @. p -= x - +(xa...) / 2N
    @turbo @. x += p
    return eo
end

This one is a tad (~10%) slower, presumably since each update requires four total iterations over m×n×... matrices rather than just three.

Like you said, I’m pretty sure that the discretization here is wrong (and obviously using a real iterative method or just using DifferentialEquations would be best); the problem is that a lot of the students in the class haven’t been exposed to math beyond high school (e.g. some of the bioinformatics majors), so the hope was to make as simple as possible an algorithm that gives a reasonable-to-the-naked-eye approximation–which I think this does:

# Make an elastic membrane and stick some momentum in the middle
eo = ElasticOrthotope(100, 100)
for i in CartesianIndices(ntuple(i->2:size(eo.x)[i]-1, N))
    d = norm(Tuple(i) .- (50.5, 50.5))
    eo.p[i] += a * exp(-d^2 / 2*6.25^2)
end
# Plot the first few hundred iterations
@gif for i=1:800
    update!(eo)
    surface(eo.x, clims=(-8,8), zlims=(-8,8), camera=(45,45),
            grid=false, colorbar=false, axis=false, ticks=1:0)
end every 8

jl_9sD0IP


As you may notice, what I’m doing with the OffsetArrays in update! is really weird–why not just iterate over 2:end-1,... rather than 1:end-2,...??? The problem is that when I do that, performance suffers pretty dramatically:

updatenormal!(eo::ElasticOrthotope{N}) where N
    @inbounds @fastmath @simd for i in CartesianIndices(ntuple(s->2:size(eo.x)[s]-1, N))
        s = 0.0
        for j=1:N
            u = CartesianIndex(ntuple(==(j), N))
            s += eo.x[i-u] + eo.x[i+u]
        end
        eo.p[i] -= eo.x[i] - s / 2N
        eo.y[i] = eo.x[i] + eo.p[i]
    end
    eo.x, eo.y = eo.y, eo.x
    return eo
end
julia> @benchmark update!(eo)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.542 μs … 157.609 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.972 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.736 μs ±   2.642 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▆▆▅▄▄▃▃▂▂▂▂▂▂▁▂▂ ▂▁                                       ▂
  █████████████████████████▇█▇▇█▇██▇▇▆▇▇▇▇▇▇▇▇▇▇█▇▇▆▆▆▆▆▆▅▅▅▄ █
  5.54 μs      Histogram: log(frequency) by time      14.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark updatenormal!(eo)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.456 μs … 103.643 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.434 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.993 μs ±   9.579 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▅▅▄▃▃▂▁▁▁      ▁                                ▁▁ ▁       ▁
  ███████████████████▇█▇▇▆▆▇▇▆▆▇▆▆▆▅▄▅▄▄▄▄▄▅▄▄▄▃▂▂▃▄██▇█▇█▆▄▄▆ █
  15.5 μs       Histogram: log(frequency) by time      58.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is really bizarre since the indices being iterated over are exactly the same; if anything, I would expect the OffsetArrays to lead to a bit of overhead. Is there something special about iterating from 1, such that iterating starting at 2 seems to mean missed optimizations? Not sure if this is worth opening a performance enhancement issue for, especially since it may be user error on my part.

5 Likes