Poor performance due to memory allocations?

Hello!

I have recently started using Julia and decided to do a simple example to test my understanding of the language and performance pitfalls. The example is a simple Jacobi relaxation for solving the Poisson’s equation. The code is fairly simple, just a double for-loop over a 2D array, see https://github.com/mirzadeh/benchmarks/blob/master/Jacobi/Jacobi.jl?ts=4

However, I am getting very poor performance. In a typical run (array of size 200x200) a similar C-code (also in the repo) takes about 2 secs to finish whereas my Julia implementation (broadcast) takes about 60 sec. Interestingly the loop version is faster and takes roughly 17 sec. Same thing when I directly call into the C-kernel from Julia.

It seems to me that the problem is related to lots and lots of memory allocation and deallocation. I am sure this is due to my lack of Julia knowledge and I would really appreciate if someone can help me understand how to fix the problem! :slightly_smiling_face:

2 Likes

Hi!

Using broadcast at the minus operation and changing the order that the loop is being computed, I reduced the execution time from 13s to 8s:

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:nx-1
		@simd for i in 2:ny-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew = copy(u)
	while err > tol
		relax_loop(u, unew)
		err = maximum(@. abs(u - unew))
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end

function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end

using BenchmarkTools
@btime run()

If you want to reduce the allocation to almost zero, you will need to create a matrix to store the computation of abs like in this code:

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:nx-1
		@simd for i in 2:ny-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew  = copy(u)
    err_m = similar(u)

	while err > tol
		relax_loop(u, unew)
		@. err_m = abs(u - unew)
        err = maximum(err_m)
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end

function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end

using BenchmarkTools
@btime run()

Now I get:

julia> include("fast.jl")
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
  4.452 s (32 allocations: 938.33 KiB)

I found this problem by using the package TimerOutputs.jl. What you need to do is to add this annotations to the locations you want to verify:

		@timeit "1" relax_loop(u, unew)
		@timeit "2" err .= maximum(@. abs(u - unew))
		@timeit "3" u, unew = unew, u
		@timeit "4" it += 1

Then, before execute the script, call reset_timer!(), execute it once, and then call print_timer(). You will see:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      9.58s / 88.6%           9.52GiB / 98.6%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 2          31.5k    7.56s  89.1%   240ΞΌs   9.38GiB  100%    313KiB
 1          31.5k    922ms  10.9%  29.3ΞΌs         -  0.00%        -
 3          31.5k   2.83ms  0.03%  89.8ns         -  0.00%        -
 4          31.5k   1.68ms  0.02%  53.5ns         -  0.00%        -
 ──────────────────────────────────────────────────────────────────

Then I realized where the allocations were happening.

EDIT: Ah, I forgot to mention that if you really want performance, then you should use a pre-compiled Julia library for your architecture (see PackageCompiler.jl).

EDIT 2: If you want to improve the performance of the broadcast version, add the following annotations (however it is still slower here, taking 5.7s to execute):

function relax(u::Array{Float64, 2}, unew::Array{Float64, 2})
	@inbounds @views @. unew[2:end-1, 2:end-1] = 0.25*(u[1:end-2, 2:end-1] + u[3:end, 2:end-1] +
                                                       u[2:end-1, 1:end-2] + u[2:end-1, 3:end])
end
1 Like

After the following changes, the looped version gives me 3.6 seconds on my machine.

Replace the order of the loops in your relax_loop, since Julia stores arrays column-wise.

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:ny-1
		@simd for i in 2:nx-1
            unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

Also replace your err calculation in solve with err = mapreduce(x -> abs(x[1] - x[2]), max, zip(u, unew)).

mapreduce version does not need allocations or a new array to store the intermediate results. Apparently, it still does allocation using with zip. See https://github.com/JuliaLang/julia/issues/28752

Unfortunately, I could not stop non-loop version from doing allocations (@views solution still does allocation for creating views and this slows down the process.)

1 Like

Here’s a still fairly good version which avoids the explicit loops. 3 simple steps

  1. Use @views to avoid allocating a new array for each slice
  2. Use dot broadcasting with @. to fuse the computation into 1 loop
  3. Use a generator to avoid allocating a new array when computing the max abs diff.
@views function relax(u::Array{Float64, 2}, unew::Array{Float64, 2})
	@. unew[2:end-1, 2:end-1] = 0.25*(u[1:end-2, 2:end-1] + u[3:end, 2:end-1] +
								   u[2:end-1, 1:end-2] + u[2:end-1, 3:end])
end
function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew = copy(u)
	while err > tol
		relax(u, unew)
                err = maximum(abs(x - y) for (x, y) in zip(u, unew))
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end
function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end
run()
julia> @time run()
took 31475 iterations with err = 9.99900789139252e-7
  7.309059 seconds (220.35 k allocations: 11.657 MiB)

1 Like

Thank you so much for all the analysis and also introducing me to TimerOutputs.jl. I’m sure it’ll come handy in the future. Are Julia arrays column-major by default? Is it possible to have C-ordering instead (row-major)?

1 Like

Yes, julia arrays are column-major like Matlab and Fortran.

I’m not sure if you can change the internal layout, or someone has a package that uses alternative layouts, but it probably makes sense to use Julia’s ordering in Julia code as this is what Julia packages will expect.

A few more ideas…

Starting from your optimized code based on the above recommendations, you have this result from @time:

  4.033748 seconds (31.51 k allocations: 1.572 MiB)

Apparently something’s allocating, since you only have two 200 x 200 matrices, which should use:

julia> "$(2 * Base.summarysize(zeros(200, 200))>>10) KB"
"624 KB"

You can easily pinpoint the reason for the allocations to the mapreduce call, to calculate the error. So replace that with a loop instead (your existing absdiff function), and you’ll get:

  5.703612 seconds (34 allocations: 625.906 KiB)

So now we’ve fixed the memory, but why is it suddenly slower? Looking at absdiff (or using code_warntype) you can see this:

du = 0
...
du = max(du, abs(u[i,j] - unew[i,j]))

This initializes du to an integer, but then changes it to a float, causing type instability. It’s easily fixed by:

du = zero(eltype(u))

Similarly, let’s change err = 1 to err = tol*2 (although this doesn’t affect the run time). Now:

  3.822652 seconds (34 allocations: 625.906 KiB)

Now let’s take a look at absdiff, which is the slowest part of your program. Let’s change this max call:

du = max(du, abs(u[i,j] - unew[i,j]))

to an explicit compare and assign:

dif = abs(u[i,j] - unew[i,j])
dif > du && (du = dif)

Big difference:

  2.228196 seconds (36 allocations: 626.906 KiB)

Finally, notice that relax_loop and absdiff both loop over u and unew in the same way. So why not combine them into a single method?

function relax_loop_diff(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)
	du = zero(eltype(u))
	@inbounds for j in 2:ny-1
		@simd for i in 2:nx-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
			dif = abs(u[i,j] - unew[i,j])
			dif > du && (du = dif)
		end
	end
	du
end

Result:

  1.775929 seconds (34 allocations: 625.906 KiB)

That’s all for now :slight_smile:

25 Likes

This is awesome, I couldn’t resist comparing against Fortran, and guess what? Your optimized version is 15% faster than the equivalent Fortran version. The times of the two codes on my machine were as follows:

Time: 1.563 s (30 allocations: 625.75 KiB) # Julia
Time: 1.79700005 s                         # Fortran

The GNU Fortran 8.2.0 compiler was used with the -Ofast option to compile the Fortran version.
Here are the two versions for reference:

Julia Version

function relax_loop_diff(u::Array{Float64, 2}, unew::Array{Float64, 2})
    nx, ny = size(u)
    du = zero(eltype(u))
    @inbounds for j in 2:ny-1
        @simd for i in 2:nx-1
            unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
            dif = abs(u[i,j] - unew[i,j])
            dif > du && (du = dif)
        end
    end
    du
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
    err = 1
    it = 1
    unew = copy(u)
    while err > tol
        err = relax_loop_diff(u, unew)
        u, unew = unew, u
        it += 1
    end
    println("took ", it, " iterations with err = ", err)
end

function run()
    u = zeros(200, 200)
    u[:, end] .= 1;
    solve(u)
end

using BenchmarkTools
@btime run()

Fortran version:

program run
implicit none
integer, parameter :: n = 200
real(8) :: u(n,n), unew(n,n), tol = 1e-6, err
integer :: it, t0, t1, count_rate, count_max

call system_clock(t0, count_rate, count_max)
   u = 0
   u(:,n) = 1
   err = 1
   it  = 1
   unew = u
   do while (err > tol)
      unew(2:n-1,2:n-1) = 0.25*(u(:n-2,2:n-1) + u(3:,2:n-1) + u(2:n-1,:n-2) + u(2:n-1,3:))
      err = maxval(abs(u - unew))
      call swap(u, unew)
      it = it + 1
   end do
   print *, "took", it, "iterations with err =", err
call system_clock(t1)
print *, 'Time:', real(t1-t0)/count_rate

contains
elemental subroutine swap(a,b)
    real(8), intent(inout) :: a, b
    real(8) :: temp
    temp=a; a=b; b=temp
end

end program run
2 Likes

What happens if you devectorize the Fortran version?

1 Like

On an unrelated note: when you have a function that mutates one of its arguments, there is a (pretty strict) convention to give it a name ending with !. There is also a (somewhat looser) convention of putting the mutated argument first in the argument list.

So, instead of

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})

you might use

function relax_loop!(unew::Array{Float64, 2}, u::Array{Float64, 2})
4 Likes

Thanks! I really learned a lot today. That’s what I love in Julia, you can code very fast, just like in MATLAB. This is excellent to verify and debug your code. Then, with little modifications, you can turn it into something that is comparable to C/Fortran.

4 Likes

I wrote the devectorized version exactly as Julia’s and it timed the same as Julia, still impressive!

2 Likes

WOW! I have learned a lot, so thank you :slight_smile:

This was a test for me to both learn Julia better and also get a feeling for its performance. I am thoroughly impressed by Julia. Right now, the fastest Julia implementation is very close to C code (only ~5% slower). The python + numba comes second, at ~20% slower, and the Matlab via mex is ~85% slower (although I suspect that suffers a lot from memory reallocations as well).

So all of this leaves me with one question. What is the best practice for writing Julia? Start with built-in functionality (using broadcasts, built-in functions, etc) and then profile incrementally or default to loops and C-like programming from the start and let the jit-compiler do its magic?

Thanks!

For Julia, I like to think:

Get it working, get it fast

Write it the simplest way possible and get it well tested. Once it’s working, then optimize out operations and arrays. It’s much easier to write fast code when you’re only changing one thing at a time, and when you have a test that tells you your changes broke it :slight_smile:.

4 Likes

Use built-in functions if they for exactly what you need but don’t force it like what is recommended in other scripting languages.

I also learned a lot from this post. Thanks! It should be cool If some masters of Julia could write books such as Effective C++ and More effective C++. Really enjoy those books

2 Likes

Two notes:

First, think long and hard whether you really need Float64. If your @simd was successful, then Float32 should be twice as fast. Problems such as yours often come from discretizing a continuous (non-finitary) problem, and your discretization error is often larger than both floating-point errors and errors incurred by using only a finite number of iterations. So extract the discretization error from whatever PDE theorem you where using, don’t set a tighter tolerance than warranted and switch to Float32 whenever possible.

Second, use Float32 to test whether your code actually uses SIMD (you should also read the @code_native, but that takes some time to get used to). You want an almost 2x speed-up; otherwise something is rotten.

julia> using BenchmarkTools

julia> function relax_loop_diff(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               @inbounds for j in 2:ny-1
                       @simd for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               dif = abs(u[i,j] - unew[i,j])
                               dif > du && (du = dif)
                       end
               end
               du
       end;

julia> u64=rand(200,200); u64n = copy(u64); u32 = Float32.(u64); u32n=copy(u32);

julia> @btime relax_loop_diff(u64, u64n)
  90.744 ΞΌs (1 allocation: 16 bytes)
0.911639555429068

julia> @btime relax_loop_diff(u32, u32n)
  86.308 ΞΌs (1 allocation: 16 bytes)
0.9116396f0

julia> 90*2000/(200*200)
4.5

I’m running this on a 2ghz machine, and the loop takes 4.5 cycles per iteration. This feels absurdly bad. The Float32 code is insignificantly faster than the Float64 code. This tells us that something went wrong.

So let’s do better. It is known that LLVM has problems figuring out that max is associative, and therefore all max-reductions suck. Further, there are sometimes issues with aliasing analysis: The compiler needs to take into account the possibility that u and unew alias, which may prevent use of some vectorized constructions.

Let’s first check the maximum problem:

julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               
               @inbounds for j in 2:ny-1
                       @simd for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               #dif = abs(u[i,j] - unew[i,j])
                               #dif > du && (du = dif)
                       end
               end
               du
       end
julia> @btime relax_loop_diff_ne(u64, u64n)
  35.499 ΞΌs (1 allocation: 16 bytes)
0.0
julia> @btime relax_loop_diff_ne(u32, u32n)
  17.127 ΞΌs (1 allocation: 16 bytes)
0.0f0
julia> @btime broadcast!(identity, u32n, u32);
  9.433 ΞΌs (0 allocations: 0 bytes)
julia> @btime broadcast!(identity, u64n, u64);
  23.927 ΞΌs (0 allocations: 0 bytes)

First, this is way faster. Second, we see that SIMD is used. Third, we see that we are within 2x of memcopy speed.

We can next check for the second problem (aliasing) by benchmarking

julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               
               @inbounds for j in 2:ny-1
                       @simd ivdep for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               #dif = abs(u[i,j] - unew[i,j])
                               #dif > du && (du = dif)
                       end
               end
               du
       end

This gives no significant additional advantage, so no reason to make a fuss about this.

So, what should you do about it? For example, check the error only every couple of iterations. In addition, you could try to make the max faster, possibly using SIMD.jl. In principle Distances.jl should have a reasonably fast implementation of the max-distance (and if not, then open an issue there).

4 Likes