Understanding Julia performance in simple finite difference code


#1

I gave myself an exercise to help learn several aspects of Julia, including coding style and performance issues. The task was to translate a C code for a very simple 1D, FD implementation for acoustic waves. At the end of this post, I copy in the entire key section of the code for reference.
To run it, I used @time and @timed as:

@timed include("myprogname.jl")

My first effort was to implement the differencing using, e.g.,

 (v-circshift(v,1))

and I compared to a brute force loop approach that copies the original C code:

    for iv in 1:(length(dv)-1)
        dv[iv] = p[iv+1]-p[iv]
    end

The questions are:

  1. The second form, with the for loops, takes about 40 sec, but the form with (v-circshift(v,1)) is about 2.5 sec. I expected the first version to be faster, but I am surprised by the factor of ~15 - can someone explain why?

  2. @timed indicates that the first form has allocations of about 5GB and the second 20GB (which might be related to speed). I interpret this as memory allocations, but I don’t understand the large amounts. The total array sizes should be a few hundred KB at most.

  3. I tried it with Julia 0.7 (code version 1 speeds up to ~2 sec, yay!). I noticed that 0.7 really really wants me to put the global qualifier in front of relevant vectors in the for loops. I did that, but I’m not quite sure why it is important (minor quibble: it distracts from code readability, at least for this learner).

Any comments would be appreciated.

c = fill(2500.0,nNodes)
ρ = fill(1000.0,nNodes)
dp = zeros(nNodes)
p = zeros(nNodes)
v = zeros(nNodes)

pStepWeight = (ρ.*c.*c)*dt/dz

for iTime in 1:nTimeSteps
    global dp = (v-circshift(v,1)).*pStepWeight
    global p += dp
    p[1] = 0.0

    p[iSource] += source(iTime*dt, dt, dz, t0, α)

    global dv = (circshift(p,(-1))-p)./ρMean*dt/dz
    global v += dv
    v[nNodes] = (ρMean[nNodes]*dz - dt*ρ[nNodes]*c[nNodes])*v[nNodes]-2*dt*p[nNodes]
    v[nNodes] /= (ρMean[nNodes]*dz+dt*ρ[nNodes]*c[nNodes])
end


#2

It is MUCH better not to use globals. Try putting the logic into a function. Look up @code_warntype in this forum and in the Julia documentation. If you apply it to your code and see an Any type, your code is spending time in judging the suitability of operations and variables each time that line is hit: a big performance problem.


#3

The correct expectation for Julia code is that the loop should be faster if performance traps are avoided. In this case the problem is the use of global variables, recommended reading is
https://docs.julialang.org/en/stable/manual/performance-tips/

Those are repeated temporary allocations. The cause here is also the use of global variables and will solve itself when you put the code in a function.

This problem will also go away when you put the code into a function. And you really, really want to remove those global declarations when you do.


#4

If you’re interested in exploring performance in Julia, the first thing to do is to read through the performance tips from the manual.

Here are a few suggestions based on those tips:

Provide a reproducible example

People here actually really like performance-engineering of Julia code for fun, but it’s harder to help if we can’t participate. Ideally, we should be able to just copy and paste the code from your post and run it, but in your case we’re missing at least the definitions of nNodes and source(). If you provide something that can be copy-pasted and run, you’ll get better help.

What to time

Timing include() is not very representative. When you do that, you’re timing:

  • reading in a text file
  • parsing it into Julia code
  • compiling that code into functions
  • actually running your algorithm

and you have no way of knowing how much of the time is spent in each part. Instead, put the algorithm you want to test into a function, and time calling that function. Better, use https://github.com/JuliaCI/BenchmarkTools.jl to time that function multiple times and extract a statistically meaningful estimate of its performance.

Globals

The very first performance tip from the manual is “Avoid global variables”, and we mean it. The type of a global variable can change at any point, so every piece of code that uses it must be able to handle its type changing at any time. That means that Julia can’t generate fast, specialized native code because the types of all of your variables can’t be cleanly inferred.

Instead, just put your code in a function. Anything that function needs from outside should be an argument to the function. This will make a huge performance difference in Julia, and it also has the benefit of being pretty universally regarded as better than having lots of non-constant global variables.

Something like:

function do_the_thing(nNode, c, rho, nTimeSteps)
  dp = ...
  p = ...
  v = ...
  for i in 1:nTimeSteps  
    update_p_and_v!(p, v, dp, dv, c, rho, ....)
  end
end

and then when you benchmark, do:

using BenchmarkTools
@benchmark do_the_thing($nNode, $c, $rho, $nTimeSteps)

Avoiding unnecessary memory allocation

The second performance tip mentions paying attention to memory allocation. Your code is allocating a lot of unnecessary memory because every time you do something like p += dp, you are really doing:

  • compute p + dp, creating a brand new array
  • store that new array in the variable p

That’s also why you’re getting warnings in v0.7. Every time you do p += dp, you’re trying to overwrite the global variable p. That’s doubly slow, because you’ve created a new array and you’re accessing a global variable.

Fortunately, the fix basically couldn’t be simpler:

p .= p .+ dp

By putting dots everywhere, you get the magic of Julia’s broadcast loop fusion which guarantees that, rather than creating a copy, it will simply update the values of p in-place.


#5

Oh, but to your actual point, the for-loop version will probably be faster because it allows you to operate on the elements of dv in-place, rather than creating a brand new array.

This is much easier to show with a simpler benchmark:

julia> x = collect(1:1000);

julia> dv = similar(x);

julia> function f1!(dv, x)
         dv = x - circshift(x, 1)
       end
f1! (generic function with 1 method)

julia> function f2!(dv, x)
         for i in 1:(length(x) - 1)
           dv[i] = x[i + 1] - x[i]
         end
         dv[end] = x[1] - x[end]
       end
f2! (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark f1!($dv, $x)
BenchmarkTools.Trial: 
  memory estimate:  15.88 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.120 μs (0.00% GC)
  median time:      1.495 μs (0.00% GC)
  mean time:        1.743 μs (11.54% GC)
  maximum time:     59.995 μs (87.90% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark f2!($dv, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     780.615 ns (0.00% GC)
  median time:      781.019 ns (0.00% GC)
  mean time:        826.681 ns (0.00% GC)
  maximum time:     2.470 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     104

Edit: actually, my above version of f1!() was sub-optimal, because I could have done the same dot-fusion trick I suggested above. Here’s a better implementation:


julia> function f1!(dv, x)
         dv .= x .- circshift(x, 1)
       end
f1! (generic function with 1 method)

julia> @benchmark f1!($dv, $x)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     850.571 ns (0.00% GC)
  median time:      964.524 ns (0.00% GC)
  mean time:        1.102 μs (9.27% GC)
  maximum time:     15.104 μs (83.12% GC)
  --------------
  samples:          10000
  evals/sample:     42

which is almost as fast as the for loop.


#6
The correct expectation for Julia code is that the loop should be faster if performance traps are avoided. In this case the problem is the use of global variables, recommended reading is
https://docs.julialang.org/en/stable/manual/performance-tips/

Thanks. Fortunately I put the First Steps tag on the post :grin: - so far, I’m finding there are several conceptual shifts in learning effective use of Julia. For a variety of reasons, I would never ever want to use globals, but until I got the warnings from Julia 0.7 I didn’t know I was using them. From this and other replies I’m now better understanding how this works in Julia.

Part of my error was thinking that since everything was in a single file there was no real issue of global vs local. My bad. :frowning_face:


#7

OK - next time I will! I was trying to be thoughtful in keep it short. I have to say, I’m impressed by community support of new users here. A plus for Julia.


#8

Got it - I had figured out the broadcast on the RHS, with + and a number of other functions and found it very cool. I had not yet realized that one can do the broadcast on =.

So, when I get a little time later in the day, I will update the code to:

  • eliminate globals given my corrected understanding
  • apply the broadcast loop fusion as appropriate
  • measure with BenchmarkTools
  • reread the Performance section of the docs.

I’ve been working to iterate on reading and rereading sections of the documentation as I build background, but it takes a few passes to grasp some key points. :upside_down_face:

Thanks to all for the suggestions. I’ll plan to report back; if there are outstanding issues I’ll paste in the whole code too.


#9

Everything that is not encapsulated in a function is a global variable.
You can also explicitly declare a global variable inside a function by using the global keyword:

a = 1 # this is a global variable

function test()
    b = 1 # this is a local variable
    global c = 1 #this is a global variable
    return b
end

If you copy and paste the code above in the terminal, then you get:

julia> a
1

julia> b
ERROR: UndefVarError: b not defined

julia> c
ERROR: UndefVarError: c not defined

julia> test()
1

julia> b
ERROR: UndefVarError: b not defined

julia> c
1

#10

Join us on the Gitter chat room, we’re very friendly, we don’t bite!
It’s great for getting short questions answered fairly quickly.


#11

The official forum for Julia chat is Slack, you can join here. There is a #helpdesk channel in there where you can ask whatever questions you like. Core devs regularly hang out there, so you are likely to get correct answers quite quickly.


#12

Thanks, I did already sign up for that workspace - it seemed easier to post a longer code chunk to this discourse forum. But for shorter questions, slack seems very good.

Almost time to try the fixes to my little exercise…


#13

Dumb (or ignorant) question: in the online doc section on punctuation I can see that the ! here indicates that function f1 will change its arguments.

However that same section says that $ implies “string and expression interpolation”. From the section on Metaprogramming, I can learn a little about what interpolation does, but it is not clear to me why you would want this in what to me looks like a simple function call.

Is there an easy explanation?


#14

When @benchmark runs the code, the arguments to the function would normally be global variables.
This would mean you get dynamic dispatches and allocations. For functions that are fast enough, this could make the bulk of the fun time.

And normally we don’t care about that – if a type stable function is calling them, those specific dynamic dispatches and allocations won’t happen. So we normally prefer not to benchmark them.
One solution is to make the arguments global constants. But then the compiler is occasionally smart enough to actually elide some of the computations, so it’s most reliable and easiest to just interpolate in the arguments.

For example, on Julia 0.7:

julia> using StaticArrays, BenchmarkTools

julia> x = @SVector randn(8); y = @SVector randn(8);

julia> const a = x;
WARNING: redefining constant a

julia> const b = y;
WARNING: redefining constant b

julia> @benchmark x' * y
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     82.800 ns (0.00% GC)
  median time:      85.568 ns (0.00% GC)
  mean time:        106.039 ns (15.29% GC)
  maximum time:     66.686 μs (99.78% GC)
  --------------
  samples:          10000
  evals/sample:     967

julia> @benchmark a' * b
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.988 ns (0.00% GC)
  median time:      2.995 ns (0.00% GC)
  mean time:        3.008 ns (0.00% GC)
  maximum time:     29.273 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark $x' * $y
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.360 ns (0.00% GC)
  median time:      5.382 ns (0.00% GC)
  mean time:        5.394 ns (0.00% GC)
  maximum time:     15.310 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

#15

In case it’s not entirely clear, in the above a' * b can be completely statically evaluated:

julia> f() = a' * b
f (generic function with 2 methods)

julia> @code_llvm f()

; Function f
; Location: REPL[36]:1
define double @julia_f_37222() {
top:
  ret double 0x3FF58FA9EBEA3FAC
}

julia> reinterpret(Float64, 0x3FF58FA9EBEA3FAC)
1.3475741591864603

julia> f()
1.3475741591864603

This is because a and b are both constant and immutable so the result of a' * b is guaranteed to always be the same (1.3475741591864603 for my random values of a and b).


#16

OK, following my efforts to fix the issues cited by several folks earlier today, I’ve decreased run time from about 2.5 seconds to ~0.8 seconds. The old C code runs at about 0.3 seconds, however (time estimating from averaging about 60 runs). Also @benchmark also still suggests the code is using 1GB of memory which seems way too high. At your suggestion, I include the complete code below.

To run it from the REPL prompt, I did the following with Julia 0.7:

julia> c = fill(Float64(2500),10000);
julia> rho = fill(Float64(1000),10000);
julia> using BenchmarkTools
julia> @benchmark fd1d($c,$rho,5000,30.0,1.0,1.0,20,10000)
BenchmarkTools.Trial: 
  memory estimate:  1.06 GiB
  allocs estimate:  169722
  --------------
  minimum time:     778.029 ms (11.61% GC)
  median time:      789.278 ms (12.40% GC)
  mean time:        803.786 ms (13.11% GC)
  maximum time:     885.017 ms (18.07% GC)
  --------------
  samples:          7
  evals/sample:     1

Any thoughts how to reduce memory consumption and reduce speed would be appreciated!

Here’s the complete code:

function source(t, dt, dz, t0, α)
    x = -α*(t-t0)*(t-t0)
    if x > -40
        exp(x)
    else
        0.0
    end
end


function fd1d(c::Array{Float64,1},ρ::Array{Float64,1},iSource::Int64,freq::Float64,dz::Float64,tMax::Float64,iPrint::Int64,nNodes::Int64)
    # FD operation uses density avarage of adjacent nodes.
    # ρMean = (ρ+circshift(ρ,(-1)))/2.0
    # ρMean[nNodes] = ρ[nNodes]
    # since we have a homogeneous model at this, just copy ρ
    ρMean = ρ

    dt = dz/√2/maximum(c)/2
    nTimeSteps = Int(round(tMax/dt))
    t0 = (1/freq)*2.0
    α=2*9.8696*freq*freq

    p = zeros(Float64,nNodes)
    v = similar(p)
    dp = similar(p)
    dv = similar(p)

    pStepWeight = (ρ.*c.*c)*dt/dz
    outFile = open("fd1doutput.bin","w")

    for iTime in 1:nTimeSteps
        update_p_and_v!(p,v,dp,dv,pStepWeight,ρMean,iTime,dt,t0,α,iSource,dz,nNodes)
        if mod(iTime,iPrint)==0
            write(outFile,p)
        end
    end

    close(outFile)

    return(0)
end

function update_p_and_v!(p::Array{Float64},v::Array{Float64},dp::Array{Float64},
    dv::Array{Float64},pStepWeight::Array{Float64},ρMean::Array{Float64},iTime::Int64,
    dt::Float64,t0::Float64,α::Float64,iSource::Int64,dz::Float64,nNodes::Int64)
    ratio = dt/dz
    for ip in 2:length(dp)
        @inbounds p[ip] += (v[ip]-v[ip-1])*pStepWeight[ip]
    end
    p[1] = 0.0

    p[iSource] += source(iTime*dt, dt, dz, t0, α)

    for iv in 1:(length(dv)-1)
        @inbounds dv[iv] = p[iv+1]-p[iv]
    end
    @inbounds dv .= dv./ρMean*ratio
    @inbounds v .+= dv
    v[nNodes] = (ρMean[nNodes]*dz - dt*ρMean[nNodes]*c[nNodes])*v[nNodes]-2*dt*p[nNodes]
    v[nNodes] /= (ρMean[nNodes]*dz+dt*ρMean[nNodes]*c[nNodes])
    return(0)
end


#17

This line isn’t fusing all of the way. dv .= dv./ρMean.*ratio, or just do @. dv = dv/ρMean*ratio. @inbounds doesn’t do anything there.

How are you running this? Do you have an example?


#18

I had assumed that since ratio is a scalar Float 64, not a vector I could not broadcast. This is why there is no dot.

Here’s how I ran it (I did forget to note before that I also used include(“filename.jl”) to load the code):


#19

Think of it as a function *(::Array,::Float64), how would you write that function? If it always mutated the first array then a lot of “interesting” behavior would happen in the language, so instead it outputs a new array. If you want broadcast to fuse through a whole statement, then everything should be broadcasted.

function source(t, dt, dz, t0, α)
    x = -α*(t-t0)*(t-t0)
    if x > -40
        exp(x)
    else
        0.0
    end
end


function fd1d(c,ρ,iSource,freq,dz,tMax,iPrint,nNodes)
    # FD operation uses density avarage of adjacent nodes.
    # ρMean = (ρ+circshift(ρ,(-1)))/2.0
    # ρMean[nNodes] = ρ[nNodes]
    # since we have a homogeneous model at this, just copy ρ
    ρMean = ρ

    dt = dz/√2/maximum(c)/2
    nTimeSteps = Int(round(tMax/dt))
    t0 = (1/freq)*2.0
    α=2*9.8696*freq*freq

    p = zeros(Float64,nNodes)
    v = similar(p)
    dp = similar(p)
    dv = similar(p)

    pStepWeight = @. (ρ*c*c)*dt/dz
    outFile = open("fd1doutput.bin","w")

    for iTime in 1:nTimeSteps
        update_p_and_v!(p,v,dp,dv,pStepWeight,ρMean,iTime,dt,t0,α,iSource,dz,nNodes)
        if mod(iTime,iPrint)==0
            write(outFile,p)
        end
    end
    close(outFile)
    nothing
end

function update_p_and_v!(p,v,dp,dv,pStepWeight,ρMean,iTime,dt,t0,
                         α,iSource,dz,nNodes)
    ratio = dt/dz
    @inbounds for ip in 2:length(dp)
        p[ip] += (v[ip]-v[ip-1])*pStepWeight[ip]
    end
    p[1] = 0.0

    p[iSource] += source(iTime*dt, dt, dz, t0, α)

    @inbounds for iv in 1:(length(dv)-1)
        dv[iv] = p[iv+1]-p[iv]
    end
    @. dv = dv/ρMean*ratio
    v .+= dv
    v[nNodes] = (ρMean[nNodes]*dz - dt*ρMean[nNodes]*c[nNodes])*v[nNodes]-2*dt*p[nNodes]
    v[nNodes] /= (ρMean[nNodes]*dz+dt*ρMean[nNodes]*c[nNodes])
    nothing
end

c = fill(Float64(2500),10000);
ρ = fill(Float64(1000),10000);
using BenchmarkTools
@benchmark fd1d($c,$ρ,5000,30.0,1.0,1.0,20,10000)

Gives me:

BenchmarkTools.Trial: 
  memory estimate:  2.54 MiB
  allocs estimate:  141434
  --------------
  minimum time:     201.051 ms (0.00% GC)
  median time:      246.361 ms (0.00% GC)
  mean time:        245.380 ms (0.04% GC)
  maximum time:     291.300 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

whereas without the full broadcasting it gives:

BenchmarkTools.Trial: 
  memory estimate:  1.06 GiB
  allocs estimate:  169718
  --------------
  minimum time:     570.185 ms (13.10% GC)
  median time:      624.017 ms (15.07% GC)
  mean time:        675.527 ms (13.51% GC)
  maximum time:     1.122 s (8.46% GC)
  --------------
  samples:          8
  evals/sample:     1

so let’s see where that puts you.


#20

OK, I put the last dot in (before *ratio) and the time goes down to about 0.4 sec, pretty close to C. I can see I still need to work to better understand broadcast.

Oh, and memory is down to about 2MB which is much more reasonable. Thanks!