Efficient finite difference operators

Hey there,

I have a model that spends most time in functions like this

function Gux!(dudx::Matrix{Numtype},u::Matrix{Numtype})

    dudx[1:end-1,:] = u[2:end,:]-u[1:end-1,:]
    dudx[end,:] = u[1,:]-u[end,:]`

end

that will be evaluated over and over again with changing input matrices u. Preallocating the result dudx and reusing it already gives a speed advantage. I am currently testing a couple of other ways to write essentially the same operation, but I was wondering whether there are any performance tips that are obvious for someone with more insight into Julia. For example, does the order of the lines inside the function matter (regarding reading and writing from memory)? Is Matrix{Numtype} where Numtype could be Float32, Float64 or BigFloat etc. the way to give the compiler enough information what input arguments to expect? The size of u will not change throughtout a computation, hence would it be advantageous to also pass on that information? Furthermore, is the matrix slicing that I use to write, as I find it more convenient than writing loops, a bottleneck? I am aware of the column-major order for instance but not sure whether writing with matrix slices always respects that.

Thanks for any hints!

1 Like

you could try the following

julia> function Gux2!(dudx::Matrix{T},u::Matrix{T}) where T
           @views dudx[1:end-1,:] .= u[2:end,:] .- u[1:end-1,:]
           @views dudx[end,:] .= u[1,:] .- u[end,:]
           dudx
       end
julia> u = rand(100,100);

julia> du = zeros(u);

julia> @btime Gux!($du, $u);
  26.539 Ī¼s (18 allocations: 235.17 KiB)

julia> @btime Gux2!($du, $u);
  3.274 Ī¼s (2 allocations: 112 bytes)
5 Likes

Donā€™t think of the types you specify in the method signature as ā€œhelping the compilerā€ in terms of performance - which AFAIK it doesnā€™t. Think of it as ā€œrestricting this specific method to just the specified subset of typesā€.

For example the solution I posted above is probably overly restrictive in its signature. As it is written, the method will only actually be called when

  1. both dudx and u are actually a Matrix (so no SubArray etc), and
  2. both dudx and u have the same eltype (which in this case is probably expected, but not actually necesarry for the code to work)

everything else will simply throw a MethodError.

So what would one typically do? Since the code itself assumes that both parameters are matrices, Iā€™d probably just write the signature as Gux!(dudx::AbstractMatrix, u::AbstractMatrix).

1 Like

Donā€™t think of the types you specify in the method signature as ā€œhelping the compilerā€ in terms of performance - which AFAIK it doesnā€™t.

Thatā€™s correct.
If you call Gux! with Numtype = Float32, again with Numtype = Float64, and then again with Numtype = BigFloat, the compiler will have created three separate versions of the function ā€“ one for when Numtype == Float32, another for when Numtype == Float64, anotherā€¦

Whenever you call a function with a different combination of input types, it under the hood creates a version specialized on that combination of input types.
Applying restrictions just forbids it from accepting certain things.

Of course, if the algorithm or implementation changes based on input types, then annotations are useful.

EDIT:
Also, unfortunately this is still a case where for loops are fastest:

julia> @btime Gux2!($du, $u);
  3.014 Ī¼s (6 allocations: 336 bytes)

julia> function Gux3!(dudx::AbstractMatrix,u::AbstractMatrix)
           m, n = size(dudx)
           @boundscheck (m,n) == size(u) || throw(BoundsError())
       
           @inbounds for i āˆˆ 1:n
               for j āˆˆ 1:m-1
                   dudx[j,i] = u[j+1,i]-u[j,i]
               end
               dudx[m,i] = u[1,i] - u[m,i]
           end
       
       
       end
Gux3! (generic function with 1 method)

julia> @btime Gux3!($du, $u);
  2.177 Ī¼s (0 allocations: 0 bytes)

The difference was smaller on Julia 0.6:

julia> @btime Gux2!($du, $u);
  2.957 Ī¼s (2 allocations: 112 bytes)

julia> @btime Gux3!($du, $u);
  2.472 Ī¼s (0 allocations: 0 bytes)

(I think the broadcasting regression we see here will be fixed before 0.7 is released.)

1 Like

Great. Thanks for the explanation! Is there any general advice whether writing the same operation as a loop is beneficial? I know that is says somewhere in the manual that you should only use matrix operations when it feels natural ā€¦ whatever that means.

Thanks so much guys! That saves me a lot of computing time :wink:

I just want to point out that DiffEqOperators.jl has matrix-free operators that make * or mul! write this loop for you.

https://github.com/JuliaDiffEq/DiffEqOperators.jl/blob/master/docs/HeatEquation.md

I have found the full loop to be better, as seen in this tutorial notebook:

http://nbviewer.jupyter.org/github/JuliaDiffEq/DiffEqTutorials.jl/blob/master/Introduction/OptimizingDiffEqCode.ipynb#Optimizing-Large-Systems

so DiffEqOperators basically tries to make those.

Yeah, I heard of this package, however, I will need a couple of other unusual stencils with unusual datatypes without promotion from one type to the other. I.e. something like 0.5*u[i,j] needs to be written as one_half*u[i,j] where the type of one_half corresponds to the type in u etcā€¦ Thanks for mentioning it though!

Personally I think the best advice here is to get into to the habit of benchmarking your code to build up your own intuition first hand. Using GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language has helped me countless times and continues to do so even after years of coding in julia

1 Like

Feel free to open an issue on this. We are still actively developing it so it would be great to know the use cases.

The explicit loop is much more readable IMO.

As a follow up on that topic: Once there is a multiplication with a constant involved, the matrix version is 3x faster. Or am I missing here some obvious optimization?

function Ix!(ux::AbstractMatrix,u::AbstractMatrix)
    
    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i āˆˆ 1:n
        for j āˆˆ 1:m
            ux[j,i] .= 0.5*(u[j+1,i] .+ u[j,i])
        end
    end
end

function Ix2!(ux::AbstractMatrix,u::AbstractMatrix)
    
    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds @views ux[:,:] .= 0.5*(u[2:end,:] .+ u[1:end-1,:])
end
julia> u = rand(500,500);

julia> ux = zeros(499,500);

julia> @btime Ix!(ux,u);
  2.064 ms (249500 allocations: 11.42 MiB)

julia> @btime Ix2!(ux,u);
  594.146 Ī¼s (5 allocations: 3.81 MiB)

you dont need the dot here

ux[j,i] .=

I then get

julia> @btime Ix!(ux,u);
  80.177 Ī¼s (0 allocations: 0 bytes)

Should also use 0.5 .* ...

If the .= and .+ creates such an overhead it is really starnge.
Could you show the timing results of all 3 (The 2 codes written above by @milankl and yours without the dots)?

Could you show the timing for with and without the dots?

I meant on the second version to have it fuse.

A loop version, a matrix version, and a loop version with .=

function Ix_loop!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i āˆˆ 1:n
        for j āˆˆ 1:m
            ux[j,i] = 0.5*(u[j+1,i] + u[j,i])
        end
    end
end

function Ix_mat!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds @views ux[:,:] .= 0.5*(u[2:end,:] .+ u[1:end-1,:])
end

function Ix_loop_dot!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i āˆˆ 1:n
        for j āˆˆ 1:m
            ux[j,i] .= 0.5*(u[j+1,i] + u[j,i])
        end
    end
end

and the timings are

julia> @btime Ix_loop!($ux,$u);
  85.346 Ī¼s (0 allocations: 0 bytes)

julia> @btime Ix_mat!($ux,$u);
  565.237 Ī¼s (5 allocations: 3.81 MiB)

julia> @btime Ix_loop_dot!($ux,$u);
  2.108 ms (249500 allocations: 11.42 MiB)

u and ux are as above.

And probably the worst is if you put dots everywhere

function Ix_loop_dotdotdot!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i āˆˆ 1:n
        for j āˆˆ 1:m
            ux[j,i] .= 0.5.*(u[j+1,i] .+ u[j,i])
        end
    end
end
julia> @btime Ix_loop_dotdotdot!($ux,$u);
  55.061 ms (1247500 allocations: 26.65 MiB)

Could anyone say why the dots add such an overhead?