# Matrix Free Differentiation

Hello all,

I am an Applied Mathematics PhD student at CU Boulder, and I want to implement fast matrix free finite difference methods using Julia. This appeared at first glance to be quite simple task, but when I compare the method to simply forming a sparse matrix, my method is slower.

I believe that I have followed the advice laid out in the Multidimensional algorithms and iteration post that @tim.holy put together. I am a bit stumped at this point and think that I am missing something obvious.

I am running Julia 1.8 on an Intel MacBook, if that is helpful.

``````using SparseArrays, LinearAlgebra, BenchmarkTools
function D_matFree(u::AbstractArray{T,N}, h::Real, order::Real, dim::Int) where {T<:Real,N}
du = similar(u)
if order == 0
du[:] .= u[:]
return
elseif order == 1
fwd =  1/ h * [-11/6, 3, -3/2, 1/3]	# 3rd order
# fwd = 1/ h * [3/2,-2, 1/2] # 2nd order
cnt = 1 / h * [-1/2, 0, 1/2]
elseif order == 2
fwd = 1/h^2 * [35/12, -26/3, 19/2, -14/3, 11/12] # 3rd order
# fwd = 1/h^2 * [2,-5,4,-1] # 2nd order
cnt = 1/h^2 * [1, -2, 1]
elseif order == 3
fwd = 1/h^3 *[-17/4,71/4,-59/2,49/2,-41/4,7/4] # 3rd order
# fwd = 1/h^3 * [-5/2,9,-12,7,-3/2] # 2nd order
cnt = 1/h^3 * [-1/2, 1, 0, -1, 1/2]
else
@error "Not implemented for order \$order"
end
bwd = mod(order,2) == 0 ? fwd : -fwd

sz = size(u)
@assert all(size(du) .== sz)

J = length(fwd)
bnd = Int(floor(length(cnt)/2))
left = CartesianIndices(sz[1:dim-1])
right = CartesianIndices(sz[dim+1:N])
@inbounds for i in 1:sz[dim], l in left, r in right
if i <= bnd
du[l,i,r] = dot((@view u[l,i:i+J-1,r]), fwd)
elseif sz[dim]-bnd <= i
du[l,i,r] = dot((@view u[l,i:-1:i-J+1,r]), bwd)
else
du[l,i,r] = dot((@view u[l,i-bnd:i+bnd,r]), cnt)
end
end
du
end

function D_sparse(u::AbstractArray{T,N}, h::Real, order::Real, dim::Int) where {T<:Real,N}
du = similar(u)
if order == 0
du[:] .= u[:]
return
elseif order == 1
fwd =  1/ h * [-11/6, 3, -3/2, 1/3]	# 3rd order
# fwd = 1/ h * [3/2,-2, 1/2] # 2nd order
cnt = 1 / h * [-1/2, 0, 1/2]
elseif order == 2
fwd = 1/h^2 * [35/12, -26/3, 19/2, -14/3, 11/12] # 3rd order
# fwd = 1/h^2 * [2,-5,4,-1] # 2nd order
cnt = 1/h^2 * [1, -2, 1]
elseif order == 3
fwd = 1/h^3 *[-17/4,71/4,-59/2,49/2,-41/4,7/4] # 3rd order
# fwd = 1/h^3 * [-5/2,9,-12,7,-3/2] # 2nd order
cnt = 1/h^3 * [-1/2, 1, 0, -1, 1/2]
else
@error "Not implemented for order \$order"
end
bwd = mod(order,2) == 0 ? fwd : -fwd
sz = size(u)
n = sz[dim]
I = Int[]
J = Int[]
val = Float64[]
bnd = Int(floor(length(cnt)/2))
for i = 1:n
if i <= bnd
k = length(fwd)
I = vcat(I, i*ones(k))
J = vcat(J, i:i+k-1)
val = vcat(val, fwd)
elseif n-bnd <= i
k = length(bwd)
I = vcat(I, i*ones(k))
J = vcat(J, i:-1:i-k+1)
val = vcat(val, bwd)
else
k = length(cnt)
k2 = Int(floor(k/2))
I = vcat(I, i*ones(k))
J = vcat(J, i-k2:i+k2)
val = vcat(val, cnt)
end
ix = findall(1 .> J .|| n  .< J )
@assert isempty(ix) "\$i = i, ix = \$(J[ix])"
end
# return J
DD = sparse(I,J,val,n,n)

left = CartesianIndices(sz[1:dim-1])
right = CartesianIndices(sz[dim+1:N])
for l in left, r in right
du[l,:,r] = DD * u[l,:,r]
end
return du
end
##
u = rand(1000, 5000)
h = 0.1
order = 1
dim = 1
@info "Matrix Free"
@btime du1 = D_matFree(u, h, order, dim)
@info "Sparse Implementation"
@btime du2 = D_sparse(u, h, order, dim);
relErr = norm(du1-du2)/norm(du1)
@info "Relative Error \$relErr"
nothing
``````

1 Like

A couple of things going on here, but you need to carefully go line by line and eliminate all allocations in your core differentiation functions. Take a look at allocation tracking to figure out your worst culprits. You should know how much space youâ€™ll need for each of your temporaries and output arrays, so allocate those ahead of time and pass them in as arguments. Your computational kernel should specialize on the dimensionality and order of your operators, and that specialization will be made easier if you use static arrays for your operators, and pass those in as arguments as well. After thatâ€™s out of the way, youâ€™ll probably want to restructure your `@inbounds` loop into three separate loops matching each case of the `if` statement to avoid having a branch in your inner loop. For the sparse implementation, if you canâ€™t preallocate, use `append!` or `push!` instead of repeatedly `vcat`-ing.

3 Likes

Alternatively, see this post for an example of a fast dimension-agnostic matrix-free finite-difference stencil: Seemingly unnecessary allocation within for loop - #8 by stevengj

Note that, as mentioned in this post, you are typically better off using ghost cells / padding to implement your boundary conditions than putting `if` statements inside critical inner loops. And for something performance-critical like the difference operation, you really benefit from hard-coding/unrolling it.

4 Likes

This pattern and the subsequent indexing will be type-unstable because the `1:dim-1` cannot be resolved at compile time. One option would be to change the `dim::Int` input to `::Val{dim}` to lift it into compile time, but that would still require good constant propagation (maybe wonâ€™t happen) or for you to rewrite `sz[1:dim-1]` as something like `ntuple(i->sz[i], Val{dim-1}())` so that the length can be inferred at compile time.

One useful tip, if you make the suggested change to `StaticArrays`, is to use `SVector`s or `SizedVector`s of indices. For example, `u[l, SizedVector{J}(i:i+J-1), r]` should return a `SVector` equivalent to `u[l, i:i+J-1, r]` and will be type-stable if `J` is a compile-time constant (like if itâ€™s the length of a known `SVector`). You might want to lift the `fwd` and `cnt` definitions into a higher function, or make `order` a `Val` argument, so that within the function containing this loop `J` can be known by the compiler.

What do you mean by that?

1 Like

Thanks for all the help everyone. I made some update to my code that made it work much better!

My main takeaways are:

• Break functions into and outer general function and an inner specialized function. This allows the compiler to optimize the inner function based on the size/type of my inputs.
• Use StaticArrays when the size of the array is known
• Building `Tuple`, with `ntupel(::Function, ::Val)` allows for us to continue to minimize allocations with the creation of lists of indices
``````using SparseArrays, LinearAlgebra, Coverage, StaticArrays, BenchmarkTools
##

function getIx(h::Real, order::Int )
if order == 1
fwd = 1/h * SA[-11/6, 3, -3/2, 1/3]
cnt = 1/h * SA[-1/2, 0, 1/2]
elseif order == 2
fwd = 1/h^2 * SA[35/12, -26/3, 19/2, -14/3, 11/12]
cnt = 1/h^2 * SA[1, -2, 1]
elseif order == 3
fwd = 1/h^3 * SA[-17/4,71/4,-59/2,49/2,-41/4,7/4]
cnt = 1/h^3 * SA[-1/2, 1, 0, -1, 1/2]
else
@error "Not implemented for order \$order"
end
bwd = mod(order,2) == 0 ? fwd : -fwd
return (fwd, cnt, bwd)
end
abstract type DiffMethod end
struct SparseImpl <: DiffMethod end
struct MatFreeImpl <: DiffMethod end
function D(u::AbstractArray{T,N}, h::Real, order::Int, dim::Int, meth::DiffMethod=MatFreeImpl()) where {T<:Real,N}
du = similar(u)
if order == 0
du[:] .= u[:]
return
end
fwd, cnt, bwd = getIx(h,order)
sz = size(u)
D = sz[dim]
leftIx = CartesianIndices(ntuple(i->sz[i], Val(dim-1)))
rightIx = CartesianIndices(ntuple(i->sz[dim+i], Val(N-dim)))
if isa(meth, MatFreeImpl)
_D_matFree!(
du, u,
D, cnt, fwd, bwd,
leftIx, rightIx
)
elseif isa(meth, SparseImpl)
_D_sparse!(
du, u,
D, cnt, fwd, bwd,
leftIx, rightIx
)
else
@error "Not Implemented"
end
return du
end
##

function _D_matFree!(
du::AbstractArray{T,N}, u::AbstractArray{T,N}, D::Int,
cnt::SVector{K,T}, fwd::SVector{J,T}, bwd::SVector{J,T},
leftIx::CartesianIndices, rightIx::CartesianIndices
) where {T<:Real,N,J,K}
bnd = Int(floor(K/2))
@assert all(size(du) .== size(u)) "Input and output must be of the same size"

@inbounds for l in leftIx,r in rightIx
for i in 1:bnd
ix = SVector(ntuple(x->i-1+x, Val(J)))
du[l,i,r] = dot(fwd, @view u[l,ix,r])
end
for i = bnd+1:D-bnd-1
ix = SVector(ntuple(x->i-bnd+x-1, Val(K)))
du[l,i,r] = dot(cnt, @view u[l,ix,r])
end
for i = D-bnd+1:D
ix = SVector(ntuple(x->i-x+1, Val(J)))
du[l,i,r] =dot(bwd, @view u[l,ix,r])
end
end
nothing
end

function _D_sparse!(
du::AbstractArray{T,N}, u::AbstractArray{T,N}, D::Int,
cnt::SVector{K,T}, fwd::SVector{J,T}, bwd::SVector{J,T},
leftIx::CartesianIndices, rightIx::CartesianIndices
) where {T<:Real,N,J,K}
# pre allocating here would help with speed
rowIx = Int[]
colIx = Int[]
nzVal = Float64[]
bnd = Int(floor(length(cnt)/2))
for i = 1:D
if i <= bnd
k = length(fwd)
append!(rowIx, i*ones(k))
append!(colIx, i:i+k-1)
append!(nzVal, fwd)
elseif D-bnd <= i
k = length(bwd)
append!(rowIx, i*ones(k))
append!(colIx, i:-1:i-k+1)
append!(nzVal, bwd)
else
k = length(cnt)
k2 = Int(floor(k/2))
append!(rowIx, i*ones(k))
append!(colIx, i-k2:i+k2)
append!(nzVal, cnt)
end
end
## build the spars matrix
DD = sparse(rowIx,colIx,nzVal,D,D)
## Apply it
for l in leftIx, r in rightIx
du[l,:,r] = DD * @view u[l,:,r]
end
nothing
end
## Define Helper Function to test timeing
function testImpl(u, h, order, dim)
@info "==================================="
@info "Testing dim = \$dim, order = \$order"
@info "\tMatrix Free"
@btime D(u, h, order, dim, MatFreeImpl())
@info "\tSparse Implementation"
@btime D(u, h, order, dim, SparseImpl());
du1 = D(u, h, order, dim, MatFreeImpl())
du2 = D(u, h, order, dim, SparseImpl());
relErr = norm(du1-du2)/norm(du1)
@info "\tRelative Error \$relErr"
nothing
end
## Define a test problem
u = rand(1000, 100, 2000)
h = 0.1
for order = 1:3
for dim = 1:length(size(u))
testImpl(u, h, order, dim)
end
end
nothing
``````

Results for the first order finite differences:

Caveat: I know that I did not fully optimize the sparse implementation because I did not preallocate the `rowIx, colIx, nzVal` vectors, but I hope that the matrix free method would still have an edge in this case.

Explanation of the lack of BC. I am looking at using these operators to be approximating derivatives to build a Linear System in the (W)SINDy Algorithm (extending SINDy and WSINDy). So I may not know anything about the BC a priori. Otherwise the discussion of ghost cells and the like would be more pertinent. I worry about stability with this implementation. But I will cross that bridge later.

Questions Remaining

• `@simd, @fastmath` these macro allowed for me to make some improvements on speed but sometimes caused the answer to change. I know that `@fastmath` handles subnormalâ€™s differently, so for now I have rule the use of it out for now. `@simd` seems promising, but make such a small difference, that I am going to leave it off as well for now because including it seemed to change the answer. I would like to understand them better. Any discussion of when and why I should use these would be helpful.
• Is there an optimal order for the for loops? I would assume yes because Julia is column major. I though that having the `leftIx` loop be the inner most would be best, but my initial testing showed otherwise. Any intuition on why this might be?
2 Likes

Note that this `Val(dim-1)` is type-unstable because `dim` is a runtime value. To make to make it compile-time, change the signature to

``````function D(..., ::Val{dim}, ...) where {T<:Real,N,dim}
``````

and call it with `Val(dim)` instead of `dim`, assuming `dim` is a compile-time constant. That said, this isnâ€™t actually important because you introduced a function barrier by passing `leftIx`,`rightIx` into `_D_matFree!` so the type-instability doesnâ€™t propagate. Given this, I expect thereâ€™s little to gain from this. More simply, maybe just change `ntuple(i->sz[i], Val(dim-1))` to `Tuple(sz[1:dim-1])` since itâ€™s type-unstable anyway (yes, this is basically what you had at the start â€“ sorry). Again, the function barrier is what makes this performant despite the instability.

I was recently reminded in another thread (and edited my suggestion above) that `SizedArray` are good for getting static slices too, and are a bit cleaner. You might find `SizedArray{J}(i:J-1)` to be simpler and equally performant to the `SVector(ntuple(x->i-1+x, Val(J)))` I originally suggested.

Those aside, letâ€™s get to your actual questions:

`@simd` effectively applies `@fastmath` up through `v1.9`, but will be much narrower after `v1.10`. It is mostly useful when writing a loop where you are accumulating a sum or product of floats and donâ€™t care exactly what order theyâ€™re accumulated in (which is pretty common). Because float math is non-associative, the compiler is not allowed to re-order the loop (even if it would be faster or â€śmore accurateâ€ť) without `@simd` or `@fastmath` (and `@simd` might have a little extra secret sauce to help it happen). In particular, this can sometimes enable SIMD instructions which can make loops over relatively simple computations several times faster. For example, SIMD instructions is responsible for roughly 4x of the performance difference in this comparison (it would be more with `Float32`, and the rest is probably loop unrolling):

``````julia> using BenchmarkTools

julia> x = rand(2^10);

julia> @btime foldl(+,\$x) # no SIMD or unrolling
883.019 ns (0 allocations: 0 bytes)
526.3199632799957

julia> @btime sum(\$x) # uses SIMD and probably unrolling
70.256 ns (0 allocations: 0 bytes)
526.3199632799952
``````

`@fastmath` is a spicier beast and I would recommend it be used sparingly and on the smallest code blocks possible. Most of the optimizations that it provides beyond what `v1.10`â€™s `@simd` does (which only enables the `reassoc` and `contract` flags) are often â€śeasyâ€ť to make manually (like `contract` (via Juliaâ€™s `muladd`, but also applied via `@simd`) and `arcp`) and/or dangerous to mess up (`nnan` or `ninf`, like in this example). In general, `@fastmath` is too dangerous for my taste and I can usually get most of the benefit (except `@simd`) by understanding the rules it breaks and restructuring my code so that it doesnâ€™t need to. On occasion, I may apply it to a few critical lines where Iâ€™m confident it canâ€™t mess stuff up. Iâ€™ll direct you to some further reading.

Youâ€™re right to question loop order. Julia is column-major so you want to iterate leading indices first because it helps cache performance. I sometimes forget how same-line multiple-`for` loops work, so I occasionally have to test it like so:

``````julia> for j in 1:2, i in 1:3; @show i,j; end
(i, j) = (1, 1)
(i, j) = (2, 1)
(i, j) = (3, 1) # first value is changing quickest - this is best for indexing
(i, j) = (1, 2)
(i, j) = (2, 2)
(i, j) = (3, 2)
``````

So you want the innermost loop (rightmost for same-line multiple-`for`s) to be your first index. So you may see some performance benefit to reversing your for loop to `for r in rightIx, l in leftIx`, although the fact that youâ€™re still inner-looping over middle indices (e.g. in `for i in 1:bnd`) may lose you some of this. Youâ€™d probably do somewhat better to move your `for l` loop inside of `for i` like

``````for r in rightIx
for i in 1:bnd
for l in leftIx
ix = SizedVector{J}(i:i+J-1)
du[l,i,r] = dot(fwd, u[l,ix,r]) # note: I removed the `@view` here because it probably makes the `SVector` worse
end
end
for i in bnd+1:D-bnd-1
# ...
end
# ...
end
``````

If you were fighting for every last drop of performance, removing those `dot` calls and writing those calculations manually in loop-optimal order might improve performance a bit yet, but thatâ€™s getting to be a bit of a hassle. I vaguely recall that `Tullio.jl` might be good at expressing this whole calculation cleanly and optimizing it well, but I havenâ€™t used it so canâ€™t say for sure.

1 Like