Efficient finite difference operators

milankl · July 17, 2018, 3:47pm

Hey there,

I have a model that spends most time in functions like this

function Gux!(dudx::Matrix{Numtype},u::Matrix{Numtype})

    dudx[1:end-1,:] = u[2:end,:]-u[1:end-1,:]
    dudx[end,:] = u[1,:]-u[end,:]`

end

that will be evaluated over and over again with changing input matrices u. Preallocating the result dudx and reusing it already gives a speed advantage. I am currently testing a couple of other ways to write essentially the same operation, but I was wondering whether there are any performance tips that are obvious for someone with more insight into Julia. For example, does the order of the lines inside the function matter (regarding reading and writing from memory)? Is Matrix{Numtype} where Numtype could be Float32, Float64 or BigFloat etc. the way to give the compiler enough information what input arguments to expect? The size of u will not change throughtout a computation, hence would it be advantageous to also pass on that information? Furthermore, is the matrix slicing that I use to write, as I find it more convenient than writing loops, a bottleneck? I am aware of the column-major order for instance but not sure whether writing with matrix slices always respects that.

Thanks for any hints!

Evizero · July 17, 2018, 4:03pm

you could try the following

julia> function Gux2!(dudx::Matrix{T},u::Matrix{T}) where T
           @views dudx[1:end-1,:] .= u[2:end,:] .- u[1:end-1,:]
           @views dudx[end,:] .= u[1,:] .- u[end,:]
           dudx
       end

julia> u = rand(100,100);

julia> du = zeros(u);

julia> @btime Gux!($du, $u);
  26.539 μs (18 allocations: 235.17 KiB)

julia> @btime Gux2!($du, $u);
  3.274 μs (2 allocations: 112 bytes)

Evizero · July 17, 2018, 4:21pm

Don’t think of the types you specify in the method signature as “helping the compiler” in terms of performance - which AFAIK it doesn’t. Think of it as “restricting this specific method to just the specified subset of types”.

For example the solution I posted above is probably overly restrictive in its signature. As it is written, the method will only actually be called when

both dudx and u are actually a Matrix (so no SubArray etc), and
both dudx and u have the same eltype (which in this case is probably expected, but not actually necesarry for the code to work)

everything else will simply throw a MethodError.

So what would one typically do? Since the code itself assumes that both parameters are matrices, I’d probably just write the signature as Gux!(dudx::AbstractMatrix, u::AbstractMatrix).

Elrod · July 17, 2018, 4:32pm

Don’t think of the types you specify in the method signature as “helping the compiler” in terms of performance - which AFAIK it doesn’t.

That’s correct.
If you call Gux! with Numtype = Float32, again with Numtype = Float64, and then again with Numtype = BigFloat, the compiler will have created three separate versions of the function – one for when Numtype == Float32, another for when Numtype == Float64, another…

Whenever you call a function with a different combination of input types, it under the hood creates a version specialized on that combination of input types.
Applying restrictions just forbids it from accepting certain things.

Of course, if the algorithm or implementation changes based on input types, then annotations are useful.

EDIT:
Also, unfortunately this is still a case where for loops are fastest:

julia> @btime Gux2!($du, $u);
  3.014 μs (6 allocations: 336 bytes)

julia> function Gux3!(dudx::AbstractMatrix,u::AbstractMatrix)
           m, n = size(dudx)
           @boundscheck (m,n) == size(u) || throw(BoundsError())
       
           @inbounds for i ∈ 1:n
               for j ∈ 1:m-1
                   dudx[j,i] = u[j+1,i]-u[j,i]
               end
               dudx[m,i] = u[1,i] - u[m,i]
           end
       
       
       end
Gux3! (generic function with 1 method)

julia> @btime Gux3!($du, $u);
  2.177 μs (0 allocations: 0 bytes)

The difference was smaller on Julia 0.6:

julia> @btime Gux2!($du, $u);
  2.957 μs (2 allocations: 112 bytes)

julia> @btime Gux3!($du, $u);
  2.472 μs (0 allocations: 0 bytes)

(I think the broadcasting regression we see here will be fixed before 0.7 is released.)

milankl · July 17, 2018, 4:44pm

Great. Thanks for the explanation! Is there any general advice whether writing the same operation as a loop is beneficial? I know that is says somewhere in the manual that you should only use matrix operations when it feels natural … whatever that means.

milankl · July 17, 2018, 4:55pm

Thanks so much guys! That saves me a lot of computing time

ChrisRackauckas · July 17, 2018, 5:00pm

I just want to point out that DiffEqOperators.jl has matrix-free operators that make * or mul! write this loop for you.

https://github.com/JuliaDiffEq/DiffEqOperators.jl/blob/master/docs/HeatEquation.md

I have found the full loop to be better, as seen in this tutorial notebook:

http://nbviewer.jupyter.org/github/JuliaDiffEq/DiffEqTutorials.jl/blob/master/Introduction/OptimizingDiffEqCode.ipynb#Optimizing-Large-Systems

so DiffEqOperators basically tries to make those.

milankl · July 17, 2018, 5:11pm

Yeah, I heard of this package, however, I will need a couple of other unusual stencils with unusual datatypes without promotion from one type to the other. I.e. something like 0.5*u[i,j] needs to be written as one_half*u[i,j] where the type of one_half corresponds to the type in u etc… Thanks for mentioning it though!

Evizero · July 17, 2018, 5:14pm

Personally I think the best advice here is to get into to the habit of benchmarking your code to build up your own intuition first hand. Using GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language has helped me countless times and continues to do so even after years of coding in julia

ChrisRackauckas · July 17, 2018, 5:18pm

Feel free to open an issue on this. We are still actively developing it so it would be great to know the use cases.

dpsanders · July 17, 2018, 9:19pm

The explicit loop is much more readable IMO.

milankl · August 7, 2018, 11:49am

As a follow up on that topic: Once there is a multiplication with a constant involved, the matrix version is 3x faster. Or am I missing here some obvious optimization?

function Ix!(ux::AbstractMatrix,u::AbstractMatrix)
    
    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i ∈ 1:n
        for j ∈ 1:m
            ux[j,i] .= 0.5*(u[j+1,i] .+ u[j,i])
        end
    end
end

function Ix2!(ux::AbstractMatrix,u::AbstractMatrix)
    
    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds @views ux[:,:] .= 0.5*(u[2:end,:] .+ u[1:end-1,:])
end

julia> u = rand(500,500);

julia> ux = zeros(499,500);

julia> @btime Ix!(ux,u);
  2.064 ms (249500 allocations: 11.42 MiB)

julia> @btime Ix2!(ux,u);
  594.146 μs (5 allocations: 3.81 MiB)

rveltz · August 7, 2018, 1:13pm

you dont need the dot here

ux[j,i] .=

I then get

julia> @btime Ix!(ux,u);
  80.177 μs (0 allocations: 0 bytes)

kristoffer.carlsson · August 7, 2018, 1:16pm

Should also use 0.5 .* ...

RoyiAvital · August 7, 2018, 1:18pm

If the .= and .+ creates such an overhead it is really starnge.
Could you show the timing results of all 3 (The 2 codes written above by @milankl and yours without the dots)?

RoyiAvital · August 7, 2018, 1:28pm

Could you show the timing for with and without the dots?

kristoffer.carlsson · August 7, 2018, 1:28pm

I meant on the second version to have it fuse.

milankl · August 7, 2018, 4:39pm

A loop version, a matrix version, and a loop version with .=

function Ix_loop!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i ∈ 1:n
        for j ∈ 1:m
            ux[j,i] = 0.5*(u[j+1,i] + u[j,i])
        end
    end
end

function Ix_mat!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds @views ux[:,:] .= 0.5*(u[2:end,:] .+ u[1:end-1,:])
end

function Ix_loop_dot!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i ∈ 1:n
        for j ∈ 1:m
            ux[j,i] .= 0.5*(u[j+1,i] + u[j,i])
        end
    end
end

and the timings are

julia> @btime Ix_loop!($ux,$u);
  85.346 μs (0 allocations: 0 bytes)

julia> @btime Ix_mat!($ux,$u);
  565.237 μs (5 allocations: 3.81 MiB)

julia> @btime Ix_loop_dot!($ux,$u);
  2.108 ms (249500 allocations: 11.42 MiB)

u and ux are as above.

milankl · August 7, 2018, 4:45pm

And probably the worst is if you put dots everywhere

function Ix_loop_dotdotdot!(ux::AbstractMatrix,u::AbstractMatrix)

    m, n = size(ux)
    @boundscheck (m+1,n) == size(u) || throw(BoundsError())

    @inbounds for i ∈ 1:n
        for j ∈ 1:m
            ux[j,i] .= 0.5.*(u[j+1,i] .+ u[j,i])
        end
    end
end

julia> @btime Ix_loop_dotdotdot!($ux,$u);
  55.061 ms (1247500 allocations: 26.65 MiB)

RoyiAvital · August 8, 2018, 4:17am

Could anyone say why the dots add such an overhead?

Topic		Replies	Views
Matrix Free Differentiation General Usage cartesianindices	7	847	December 6, 2023
Element by element wise multiplication and addition/subtraction of 3d matrices Numerics question	19	1022	June 14, 2023
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4224	March 28, 2019
How to improve the performance while do matrix operation Performance	13	818	August 27, 2018
Drastic performance hit matrix multiply different types. Internal cast julia vs numpy? Numerics	15	2448	November 4, 2018

Efficient finite difference operators

Related topics