Matrix Free Differentiation

Hello all,

I am an Applied Mathematics PhD student at CU Boulder, and I want to implement fast matrix free finite difference methods using Julia. This appeared at first glance to be quite simple task, but when I compare the method to simply forming a sparse matrix, my method is slower.

I believe that I have followed the advice laid out in the Multidimensional algorithms and iteration post that @tim.holy put together. I am a bit stumped at this point and think that I am missing something obvious.

I am running Julia 1.8 on an Intel MacBook, if that is helpful.

using SparseArrays, LinearAlgebra, BenchmarkTools
function D_matFree(u::AbstractArray{T,N}, h::Real, order::Real, dim::Int) where {T<:Real,N}
    du = similar(u)
    if order == 0
        du[:] .= u[:]
        return
    elseif order == 1 
        fwd =  1/ h * [-11/6, 3, -3/2, 1/3]	# 3rd order
        # fwd = 1/ h * [3/2,-2, 1/2] # 2nd order
        cnt = 1 / h * [-1/2, 0, 1/2]
    elseif order == 2 
        fwd = 1/h^2 * [35/12, -26/3, 19/2, -14/3, 11/12] # 3rd order
        # fwd = 1/h^2 * [2,-5,4,-1] # 2nd order
        cnt = 1/h^2 * [1, -2, 1]
    elseif order == 3 
        fwd = 1/h^3 *[-17/4,71/4,-59/2,49/2,-41/4,7/4] # 3rd order
        # fwd = 1/h^3 * [-5/2,9,-12,7,-3/2] # 2nd order  
        cnt = 1/h^3 * [-1/2, 1, 0, -1, 1/2]
    else 
        @error "Not implemented for order $order"
    end
    bwd = mod(order,2) == 0 ? fwd : -fwd

    sz = size(u)
    @assert all(size(du) .== sz)

    J = length(fwd)
    bnd = Int(floor(length(cnt)/2))
    left = CartesianIndices(sz[1:dim-1])
    right = CartesianIndices(sz[dim+1:N])
    @inbounds for i in 1:sz[dim], l in left, r in right
        if i <= bnd
             du[l,i,r] = dot((@view u[l,i:i+J-1,r]), fwd)
        elseif sz[dim]-bnd <= i 
             du[l,i,r] = dot((@view u[l,i:-1:i-J+1,r]), bwd)
        else 
             du[l,i,r] = dot((@view u[l,i-bnd:i+bnd,r]), cnt)
        end
    end
    du
end

function D_sparse(u::AbstractArray{T,N}, h::Real, order::Real, dim::Int) where {T<:Real,N}
    du = similar(u)
    if order == 0
        du[:] .= u[:]
        return
    elseif order == 1 
        fwd =  1/ h * [-11/6, 3, -3/2, 1/3]	# 3rd order
        # fwd = 1/ h * [3/2,-2, 1/2] # 2nd order
        cnt = 1 / h * [-1/2, 0, 1/2]
    elseif order == 2 
        fwd = 1/h^2 * [35/12, -26/3, 19/2, -14/3, 11/12] # 3rd order
        # fwd = 1/h^2 * [2,-5,4,-1] # 2nd order
        cnt = 1/h^2 * [1, -2, 1]
    elseif order == 3 
        fwd = 1/h^3 *[-17/4,71/4,-59/2,49/2,-41/4,7/4] # 3rd order
        # fwd = 1/h^3 * [-5/2,9,-12,7,-3/2] # 2nd order  
        cnt = 1/h^3 * [-1/2, 1, 0, -1, 1/2]
    else 
        @error "Not implemented for order $order"
    end
    bwd = mod(order,2) == 0 ? fwd : -fwd
    sz = size(u)
    n = sz[dim]
    I = Int[]
    J = Int[]
    val = Float64[]
    bnd = Int(floor(length(cnt)/2))
    for i = 1:n
        if i <= bnd
            k = length(fwd)
            I = vcat(I, i*ones(k))
            J = vcat(J, i:i+k-1)
            val = vcat(val, fwd)
        elseif n-bnd <= i 
            k = length(bwd)
            I = vcat(I, i*ones(k))
            J = vcat(J, i:-1:i-k+1)
            val = vcat(val, bwd)
        else 
            k = length(cnt)
            k2 = Int(floor(k/2))
            I = vcat(I, i*ones(k))
            J = vcat(J, i-k2:i+k2)
            val = vcat(val, cnt)
        end
        ix = findall(1 .> J .|| n  .< J )
        @assert isempty(ix) "$i = i, ix = $(J[ix])"
    end
    # return J
    DD = sparse(I,J,val,n,n)
    
    left = CartesianIndices(sz[1:dim-1])
    right = CartesianIndices(sz[dim+1:N])
    for l in left, r in right 
        du[l,:,r] = DD * u[l,:,r]
    end 
    return du
end
##
u = rand(1000, 5000)
h = 0.1
order = 1 
dim = 1
@info "Matrix Free"
@btime du1 = D_matFree(u, h, order, dim)
@info "Sparse Implementation"
@btime du2 = D_sparse(u, h, order, dim);
relErr = norm(du1-du2)/norm(du1)
@info "Relative Error $relErr"
nothing

1 Like

A couple of things going on here, but you need to carefully go line by line and eliminate all allocations in your core differentiation functions. Take a look at allocation tracking to figure out your worst culprits. You should know how much space you’ll need for each of your temporaries and output arrays, so allocate those ahead of time and pass them in as arguments. Your computational kernel should specialize on the dimensionality and order of your operators, and that specialization will be made easier if you use static arrays for your operators, and pass those in as arguments as well. After that’s out of the way, you’ll probably want to restructure your @inbounds loop into three separate loops matching each case of the if statement to avoid having a branch in your inner loop. For the sparse implementation, if you can’t preallocate, use append! or push! instead of repeatedly vcat-ing.

3 Likes

Alternatively, see this post for an example of a fast dimension-agnostic matrix-free finite-difference stencil: Seemingly unnecessary allocation within for loop - #8 by stevengj

Note that, as mentioned in this post, you are typically better off using ghost cells / padding to implement your boundary conditions than putting if statements inside critical inner loops. And for something performance-critical like the difference operation, you really benefit from hard-coding/unrolling it.

See also some related discussions:

4 Likes

This pattern and the subsequent indexing will be type-unstable because the 1:dim-1 cannot be resolved at compile time. One option would be to change the dim::Int input to ::Val{dim} to lift it into compile time, but that would still require good constant propagation (maybe won’t happen) or for you to rewrite sz[1:dim-1] as something like ntuple(i->sz[i], Val{dim-1}()) so that the length can be inferred at compile time.

One useful tip, if you make the suggested change to StaticArrays, is to use SVectors or SizedVectors of indices. For example, u[l, SizedVector{J}(i:i+J-1), r] should return a SVector equivalent to u[l, i:i+J-1, r] and will be type-stable if J is a compile-time constant (like if it’s the length of a known SVector). You might want to lift the fwd and cnt definitions into a higher function, or make order a Val argument, so that within the function containing this loop J can be known by the compiler.

What do you mean by that?

See e.g. Arrays with periodic boundaries - #4 by stevengj

1 Like

Follow Up

Thanks for all the help everyone. I made some update to my code that made it work much better!

My main takeaways are:

  • Break functions into and outer general function and an inner specialized function. This allows the compiler to optimize the inner function based on the size/type of my inputs.
  • Use StaticArrays when the size of the array is known
  • Building Tuple, with ntupel(::Function, ::Val) allows for us to continue to minimize allocations with the creation of lists of indices
using SparseArrays, LinearAlgebra, Coverage, StaticArrays, BenchmarkTools
## 

function getIx(h::Real, order::Int )
    if order == 1 
        fwd = 1/h * SA[-11/6, 3, -3/2, 1/3]
        cnt = 1/h * SA[-1/2, 0, 1/2]
    elseif order == 2 
        fwd = 1/h^2 * SA[35/12, -26/3, 19/2, -14/3, 11/12]
        cnt = 1/h^2 * SA[1, -2, 1]
    elseif order == 3 
        fwd = 1/h^3 * SA[-17/4,71/4,-59/2,49/2,-41/4,7/4]
        cnt = 1/h^3 * SA[-1/2, 1, 0, -1, 1/2]
    else 
        @error "Not implemented for order $order"
    end
    bwd = mod(order,2) == 0 ? fwd : -fwd 
    return (fwd, cnt, bwd)
end
abstract type DiffMethod end
struct SparseImpl <: DiffMethod end 
struct MatFreeImpl <: DiffMethod end 
function D(u::AbstractArray{T,N}, h::Real, order::Int, dim::Int, meth::DiffMethod=MatFreeImpl()) where {T<:Real,N}
    du = similar(u)
    if order == 0
        du[:] .= u[:]
        return
    end
    fwd, cnt, bwd = getIx(h,order)
    sz = size(u)
    D = sz[dim]
    leftIx = CartesianIndices(ntuple(i->sz[i], Val(dim-1)))
    rightIx = CartesianIndices(ntuple(i->sz[dim+i], Val(N-dim)))
    if isa(meth, MatFreeImpl)
        _D_matFree!(
            du, u, 
            D, cnt, fwd, bwd,
            leftIx, rightIx
        )
    elseif isa(meth, SparseImpl)
        _D_sparse!(
            du, u, 
            D, cnt, fwd, bwd,
            leftIx, rightIx
        )
    else 
        @error "Not Implemented"
    end
    return du
end
##

function _D_matFree!(
    du::AbstractArray{T,N}, u::AbstractArray{T,N}, D::Int,
    cnt::SVector{K,T}, fwd::SVector{J,T}, bwd::SVector{J,T},
    leftIx::CartesianIndices, rightIx::CartesianIndices
) where {T<:Real,N,J,K}
    bnd = Int(floor(K/2))
    @assert all(size(du) .== size(u)) "Input and output must be of the same size"

    @inbounds for l in leftIx,r in rightIx
        for i in 1:bnd 
            ix = SVector(ntuple(x->i-1+x, Val(J)))
            du[l,i,r] = dot(fwd, @view u[l,ix,r])
        end
        for i = bnd+1:D-bnd-1 
            ix = SVector(ntuple(x->i-bnd+x-1, Val(K)))
            du[l,i,r] = dot(cnt, @view u[l,ix,r])
        end
        for i = D-bnd+1:D 
            ix = SVector(ntuple(x->i-x+1, Val(J)))
            du[l,i,r] =dot(bwd, @view u[l,ix,r])
        end
    end
    nothing
end

function _D_sparse!(
    du::AbstractArray{T,N}, u::AbstractArray{T,N}, D::Int,
    cnt::SVector{K,T}, fwd::SVector{J,T}, bwd::SVector{J,T},
    leftIx::CartesianIndices, rightIx::CartesianIndices
) where {T<:Real,N,J,K}
    # pre allocating here would help with speed
    rowIx = Int[]
    colIx = Int[]
    nzVal = Float64[]
    bnd = Int(floor(length(cnt)/2))
    for i = 1:D
        if i <= bnd
            k = length(fwd)
            append!(rowIx, i*ones(k))
            append!(colIx, i:i+k-1)
            append!(nzVal, fwd)
        elseif D-bnd <= i 
            k = length(bwd)
            append!(rowIx, i*ones(k))
            append!(colIx, i:-1:i-k+1)
            append!(nzVal, bwd)
        else 
            k = length(cnt)
            k2 = Int(floor(k/2))
            append!(rowIx, i*ones(k))
            append!(colIx, i-k2:i+k2)
            append!(nzVal, cnt)
        end
    end
    ## build the spars matrix 
    DD = sparse(rowIx,colIx,nzVal,D,D)
    ## Apply it 
    for l in leftIx, r in rightIx
        du[l,:,r] = DD * @view u[l,:,r]
    end 
    nothing
end
## Define Helper Function to test timeing
function testImpl(u, h, order, dim)
    @info "==================================="
    @info "Testing dim = $dim, order = $order"
    @info "\tMatrix Free"
    @btime D(u, h, order, dim, MatFreeImpl())
    @info "\tSparse Implementation"
    @btime D(u, h, order, dim, SparseImpl());
    du1 = D(u, h, order, dim, MatFreeImpl())
    du2 = D(u, h, order, dim, SparseImpl());
    relErr = norm(du1-du2)/norm(du1)
    @info "\tRelative Error $relErr"
    nothing 
end
## Define a test problem 
u = rand(1000, 100, 2000)
h = 0.1
for order = 1:3
    for dim = 1:length(size(u))
        testImpl(u, h, order, dim)
    end 
end
nothing

Results for the first order finite differences:

Caveat: I know that I did not fully optimize the sparse implementation because I did not preallocate the rowIx, colIx, nzVal vectors, but I hope that the matrix free method would still have an edge in this case.

Explanation of the lack of BC. I am looking at using these operators to be approximating derivatives to build a Linear System in the (W)SINDy Algorithm (extending SINDy and WSINDy). So I may not know anything about the BC a priori. Otherwise the discussion of ghost cells and the like would be more pertinent. I worry about stability with this implementation. But I will cross that bridge later.

Questions Remaining

  • @simd, @fastmath these macro allowed for me to make some improvements on speed but sometimes caused the answer to change. I know that @fastmath handles subnormal’s differently, so for now I have rule the use of it out for now. @simd seems promising, but make such a small difference, that I am going to leave it off as well for now because including it seemed to change the answer. I would like to understand them better. Any discussion of when and why I should use these would be helpful.
  • Is there an optimal order for the for loops? I would assume yes because Julia is column major. I though that having the leftIx loop be the inner most would be best, but my initial testing showed otherwise. Any intuition on why this might be?
2 Likes

Note that this Val(dim-1) is type-unstable because dim is a runtime value. To make to make it compile-time, change the signature to

function D(..., ::Val{dim}, ...) where {T<:Real,N,dim}

and call it with Val(dim) instead of dim, assuming dim is a compile-time constant. That said, this isn’t actually important because you introduced a function barrier by passing leftIx,rightIx into _D_matFree! so the type-instability doesn’t propagate. Given this, I expect there’s little to gain from this. More simply, maybe just change ntuple(i->sz[i], Val(dim-1)) to Tuple(sz[1:dim-1]) since it’s type-unstable anyway (yes, this is basically what you had at the start – sorry). Again, the function barrier is what makes this performant despite the instability.

I was recently reminded in another thread (and edited my suggestion above) that SizedArray are good for getting static slices too, and are a bit cleaner. You might find SizedArray{J}(i:J-1) to be simpler and equally performant to the SVector(ntuple(x->i-1+x, Val(J))) I originally suggested.


Those aside, let’s get to your actual questions:

@simd effectively applies @fastmath up through v1.9, but will be much narrower after v1.10. It is mostly useful when writing a loop where you are accumulating a sum or product of floats and don’t care exactly what order they’re accumulated in (which is pretty common). Because float math is non-associative, the compiler is not allowed to re-order the loop (even if it would be faster or “more accurate”) without @simd or @fastmath (and @simd might have a little extra secret sauce to help it happen). In particular, this can sometimes enable SIMD instructions which can make loops over relatively simple computations several times faster. For example, SIMD instructions is responsible for roughly 4x of the performance difference in this comparison (it would be more with Float32, and the rest is probably loop unrolling):

julia> using BenchmarkTools

julia> x = rand(2^10);

julia> @btime foldl(+,$x) # no SIMD or unrolling
  883.019 ns (0 allocations: 0 bytes)
526.3199632799957

julia> @btime sum($x) # uses SIMD and probably unrolling
  70.256 ns (0 allocations: 0 bytes)
526.3199632799952

@fastmath is a spicier beast and I would recommend it be used sparingly and on the smallest code blocks possible. Most of the optimizations that it provides beyond what v1.10’s @simd does (which only enables the reassoc and contract flags) are often “easy” to make manually (like contract (via Julia’s muladd, but also applied via @simd) and arcp) and/or dangerous to mess up (nnan or ninf, like in this example). In general, @fastmath is too dangerous for my taste and I can usually get most of the benefit (except @simd) by understanding the rules it breaks and restructuring my code so that it doesn’t need to. On occasion, I may apply it to a few critical lines where I’m confident it can’t mess stuff up. I’ll direct you to some further reading.

You’re right to question loop order. Julia is column-major so you want to iterate leading indices first because it helps cache performance. I sometimes forget how same-line multiple-for loops work, so I occasionally have to test it like so:

julia> for j in 1:2, i in 1:3; @show i,j; end
(i, j) = (1, 1)
(i, j) = (2, 1)
(i, j) = (3, 1) # first value is changing quickest - this is best for indexing
(i, j) = (1, 2)
(i, j) = (2, 2)
(i, j) = (3, 2)

So you want the innermost loop (rightmost for same-line multiple-fors) to be your first index. So you may see some performance benefit to reversing your for loop to for r in rightIx, l in leftIx, although the fact that you’re still inner-looping over middle indices (e.g. in for i in 1:bnd) may lose you some of this. You’d probably do somewhat better to move your for l loop inside of for i like

for r in rightIx
    for i in 1:bnd 
        for l in leftIx
            ix = SizedVector{J}(i:i+J-1)
            du[l,i,r] = dot(fwd, u[l,ix,r]) # note: I removed the `@view` here because it probably makes the `SVector` worse
        end
    end
    for i in bnd+1:D-bnd-1
        # ...
    end
    # ...
end

If you were fighting for every last drop of performance, removing those dot calls and writing those calculations manually in loop-optimal order might improve performance a bit yet, but that’s getting to be a bit of a hassle. I vaguely recall that Tullio.jl might be good at expressing this whole calculation cleanly and optimizing it well, but I haven’t used it so can’t say for sure.

1 Like