I’m trying to create a simulation of an elastic string generalized to an arbitrary number of dimensions. To do so I track position and momentum (with an extra array to store the intermediate positions), and use the Euler Method to update. Conceptually, for the i^{th} cell of the array, an update from iteration n-1 to iteration n is:
h_i^{(n-1)} = x_i^{(n-1)} - \sum_a x_a^{(n-1)}
p_i^{(n)} = p_{i}^{(n-1)} - h_i^{(n-1)}
x_i^{(n)} = p_i^{(n)} + x_i^{(n-1)}
…where h is the “height” of a cell relative to its adjacent cells (a–in two dimensions it would be i-1 and i+1), p is the momentum, and x is the position. I’ve stripped mass, spring coefficients, damping, etc. for simplicity.
The simplified code is below. I ran it with julia --track-allocation=user
, and have added a comment with the number of bytes shown in resultant the *.mem
file to the end of each line with non-zero allocation count.
using Profile
mutable struct ElasticOrthotope
x::Array{Float64} # position
y::Array{Float64} # "next" position
p::Array{Float64} # momentum
function ElasticOrthotope(m::Integer...)
return new((zeros(m...) for i=1:3)...)
end
end
function update!(eo::ElasticOrthotope, t::Integer=1)
n = ndims(eo.x)
adjacents = Tuple(CartesianIndex(Tuple(j==k ? a : 0 for k=1:n)) for j=1:n for a=(-1,1)) # 112
for _=1:t
@inbounds for i in CartesianIndices(Tuple(2:lastindex(eo.x, j)-1 for j=1:n)) # 68800
s = sum(eo.x[i+a] for a in adjacents) # 115200
h = eo.x[i] - s / 2n # 38400
eo.p[i] -= h # 25600
eo.y[i] = eo.x[i] + eo.p[i] # 7200
end
eo.x, eo.y = eo.y, eo.x
end
return eo
end
eo = ElasticOrthotope(4, 6)
eo.p[2,2] = 0.1
update!(eo) # compile first
Profile.clear_malloc_data()
update!(eo, 100)
What can I do to get rid of the allocations? The one outside of the loop makes sense to me, but I can’t wrap my head around why anything within the loop (except maybe sum) would result in anything being put on the heap.
This is secondary, but is there a way to calculate s
without a comprehension so that I can use @turbo
around the loop? I’m not sure how much of a performance difference it would make, but I’ve seen @turbo
outperform @inbounds
in some cases.
Context: I’m making a video on Julia for a class I’m teaching. I’ve already used this example (but in C++) for several videos, so for cohesiveness I’d like to keep it. I was able to use broadcasting without allocating (but I have to iterate over more memory so it’s slower), and using a macro or a generator has yielded the same speed as the C++ code, but my goal is to show students that it’s possible to get as fast as C++, and more versatile with less effort, without resorting to metaprogramming.