Reduce allocations for matrix differential equations function


#1

Here’s a sample function for use with a differential equation solver to solve a system of matrices equations, d M_i / dt =f(M_{1...n}). The matrix M_i is represented by m[:,:,i], and its time derivative is stored in dm. [random number just used as example, not my actual equation but just similar mathematical form (involving matrix-matrix production and matrix-scalar production); and the size can be much bigger than 21 2-by-2 matrices]

m = rand(2,2,21)
dm = zeros(2,2,21)
b = rand(2,2)
kf(t) = sin(t)*b

comu(a,b) = a*b-b*a

function du(dm,m_in,t)
    c1 = 1
    c2 = 2
    m = [view(m_in,:,:,i) for i in 1:21]
    for k = 1:20
        dm[:,:,k] = c1*m[k]+c2*comu(m[k],m[k+1])
        for i = 1:k
            dm[:,:,k] -= comu(kf(t)*m[i],m[k-i+1])
        end
    end
end

and @time du(dm,m,1.0) returns:

0.000174 seconds (2.25 k allocations: 185.516 KiB)

for the 2nd run (after the jit compile is done).
Is there a way to cut down the number of allocations? I already used view() for the array slice and dm is already pre-allocated. Since this gets called on the order of 1e6 times for the (needed to call du for each time-step, and to get error estimation, etc), reducing the allocation here may be helpful/practical.

Changing to broadcast with

function du(dm,m_in,t)
    c1 = 1
    c2 = 2
    m = [view(m_in,:,:,i) for i in 1:21]
    for k = 1:20
        dm[:,:,k] = c1*m[k] .+ c2*comu(m[k],m[k+1])
        for i = 1:k
            dm[:,:,k] .-= comu(kf(t)*m[i],m[k-i+1])
        end
    end
end

gives similar results,

0.000264 seconds (2.25 k allocations: 178.953 KiB)

[slower, lower memory but about the same amount of allocations]

I can cut down to within 1k allocation with:

m = [rand(2,2) for i in 1:21]
dm = [zeros(2,2) for i in 1:21]

function comu(a,b,x_,y_)
    mul!(x_,a,b)
    mul!(y_,b,a)
    x_ - y_
end

function du(dm,m,t)
    c1 = 1
    c2 = 2
    x = similar(m[1])
    y = similar(m[1])
    for k = 1:20
        dm[k] = c1*m[k] .+ c2*comu(m[k],m[k+1],x,y)
        for i = 1:k
            dm[k] .-= comu(kf(t)*m[i],m[k-i+1],x,y)
        end
    end
end

result:

0.000106 seconds (926 allocations: 81.313 KiB)

Is there anything further that can be done that still preserve readability of the code? Thanks!


#2

This allocates a lot of temporaries. Write the loop or use dot fusion (https://julialang.org/blog/2017/01/moredots).


#3

Or alternatively this seems like a very natural use for StaticArrays, as you’re working with many 2x2 matrices. This actually makes du more readable:

using StaticArrays

kf(t,b) = sin(t)*b

comu(a,b) = a*b-b*a

function du(dm, m, b, t)
   c1 = 1
   c2 = 2
   for k = 1:20
       dm[k] = c1*m[k] + c2*comu(m[k],m[k+1])
       for i = 1:k
           dm[k] -= comu(kf(t,b)*m[i],m[k-i+1])
       end
   end
end

MT = typeof(zero(SMatrix{2,2,Float64}))
m = rand(MT,21)
dm = rand(MT,21)
b = rand(MT)

Then you get

julia> @time du(dm,m,b,1.0)
  0.000020 seconds (4 allocations: 160 bytes)

Note that I had to define kf(t,b) rather than kf(t) here and pass the b through explicitly because your definition of kf looks b up in global scope which completely defeats type inference. (The compiler must assume that the value of b can be any type as a non-const binding can be changed at any time.)


#4

Thanks a lot for your suggestion! Is there:

  • an easy way to turn on/off static array for a whole module (rather than specifying it for all matrices used) depending on the system size? Here 2x2 matrices are used as an example, depending on the problem it might easily go over the 100 elements “rule of thumb” listed here. [just noticed you are actually one of the authors of StaticArrays, thanks for the nice package!]
  • if one were to copy a static array to a conventional array, shall I copy via a loop by the linear indexing?
  • for problems where all matrices are guaranteed to not change size and type, what other general approach can one use to reduce the heap allocations?

btw, in my actual code b is a const known matrix so kf(t) is type-stable as reported by @code_warntype


#5

Well it’s easy to switch to non-static arrays by just passing m and dm as a Vector{Matrix{...}} to du, but this will have all the allocation problems you originally were trying to avoid. Generally we don’t have a great solution to this yet.

  • When passing SArray around you generally work in a functional style and don’t worry about copies or allocation as the compiler will remove all that and generate quite good code (at least for small sizes). This style is very mathematically natural (eg, you can express comu exactly how you’d like and not pay any cost).
  • All this changes when working with large arrays where you’ve got to write careful in-place algorithms for memory efficiency and the style ends up much more imperative and stateful. I think most people just end up writing two versions of everything…

No, just use slice assignment:

m = rand(2,2,21)
sm = @SMatrix [1 2; 3 4]
m[:,:,1] = sm

#6

For your third version, why not make comu save the output in one of x or y to avoid a (possible) temporary? Also, why not pass a parameter vector into du that stores preallocated arrays, i.e. x=p[1], y=p[2]? Finally, make sure you are dotting all the operations on the dm[:,:,k] lines. You’re missing a .= in the first… Those changes get rid of all allocations for me.


#7

Thanks! using something like

function comuxy(a,b,x_,y_)
    mul!(x_,a,b)
    mul!(y_,b,a)
    x_  .-=  y_
end

and later when I got the x, y from p[1]/p[2] passed to du, I can define a new function

comu(a,b) = comuxy(a,b,x,y)

which retains the readability/cleanness and take advantage of the preallocation. The alloc now has gone down to 4, which is really impressive. Thanks a lot!


#8

Happy that helped. Not sure why you are still seeing some allocations – I thought the changes I suggested got it down to zero on my computer (at least as reported by btime).


#9

Thanks! yes, @btime reports 0, I was doing @time in global/jupyter notebook so that might be the case. I guess for now, the solution is to use static arrays by Chris_Foster and enjoy the cleaner syntax for smaller systems, then write another version that one manually allocates arrays for larger systems.


#10

This notebook by @ChrisRackauckas has a nice tutorial on optimizing for diff eq solvers:

The basic advice is essentially what you say; static arrays for small systems and pre-allocating arrays for large systems.


#11

Well, I feel like I should point out that StaticArrays is the work of a lot of people: https://github.com/JuliaArrays/StaticArrays.jl/graphs/contributors. Andy Ferris is the main author and maintainer; I help him out now and then as do several other people.