Poor performance due to memory allocations?

mirzadeh · January 14, 2019, 5:07pm

Hello!

I have recently started using Julia and decided to do a simple example to test my understanding of the language and performance pitfalls. The example is a simple Jacobi relaxation for solving the Poisson’s equation. The code is fairly simple, just a double for-loop over a 2D array, see https://github.com/mirzadeh/benchmarks/blob/master/Jacobi/Jacobi.jl?ts=4

However, I am getting very poor performance. In a typical run (array of size 200x200) a similar C-code (also in the repo) takes about 2 secs to finish whereas my Julia implementation (broadcast) takes about 60 sec. Interestingly the loop version is faster and takes roughly 17 sec. Same thing when I directly call into the C-kernel from Julia.

It seems to me that the problem is related to lots and lots of memory allocation and deallocation. I am sure this is due to my lack of Julia knowledge and I would really appreciate if someone can help me understand how to fix the problem!

Ronis_BR · January 14, 2019, 5:24pm

Hi!

Using broadcast at the minus operation and changing the order that the loop is being computed, I reduced the execution time from 13s to 8s:

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:nx-1
		@simd for i in 2:ny-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew = copy(u)
	while err > tol
		relax_loop(u, unew)
		err = maximum(@. abs(u - unew))
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end

function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end

using BenchmarkTools
@btime run()

Ronis_BR · January 14, 2019, 5:30pm

If you want to reduce the allocation to almost zero, you will need to create a matrix to store the computation of abs like in this code:

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:nx-1
		@simd for i in 2:ny-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew  = copy(u)
    err_m = similar(u)

	while err > tol
		relax_loop(u, unew)
		@. err_m = abs(u - unew)
        err = maximum(err_m)
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end

function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end

using BenchmarkTools
@btime run()

Now I get:

julia> include("fast.jl")
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
took 31475 iterations with err = 9.99900789139252e-7
  4.452 s (32 allocations: 938.33 KiB)

I found this problem by using the package TimerOutputs.jl. What you need to do is to add this annotations to the locations you want to verify:

		@timeit "1" relax_loop(u, unew)
		@timeit "2" err .= maximum(@. abs(u - unew))
		@timeit "3" u, unew = unew, u
		@timeit "4" it += 1

Then, before execute the script, call reset_timer!(), execute it once, and then call print_timer(). You will see:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      9.58s / 88.6%           9.52GiB / 98.6%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 2          31.5k    7.56s  89.1%   240μs   9.38GiB  100%    313KiB
 1          31.5k    922ms  10.9%  29.3μs         -  0.00%        -
 3          31.5k   2.83ms  0.03%  89.8ns         -  0.00%        -
 4          31.5k   1.68ms  0.02%  53.5ns         -  0.00%        -
 ──────────────────────────────────────────────────────────────────

Then I realized where the allocations were happening.

EDIT: Ah, I forgot to mention that if you really want performance, then you should use a pre-compiled Julia library for your architecture (see PackageCompiler.jl).

EDIT 2: If you want to improve the performance of the broadcast version, add the following annotations (however it is still slower here, taking 5.7s to execute):

function relax(u::Array{Float64, 2}, unew::Array{Float64, 2})
	@inbounds @views @. unew[2:end-1, 2:end-1] = 0.25*(u[1:end-2, 2:end-1] + u[3:end, 2:end-1] +
                                                       u[2:end-1, 1:end-2] + u[2:end-1, 3:end])
end

hckr · January 14, 2019, 6:40pm

After the following changes, the looped version gives me 3.6 seconds on my machine.

Replace the order of the loops in your relax_loop, since Julia stores arrays column-wise.

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)

	@inbounds for j in 2:ny-1
		@simd for i in 2:nx-1
            unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
		end
	end
end

Also replace your err calculation in solve with err = mapreduce(x -> abs(x[1] - x[2]), max, zip(u, unew)).

~~mapreduce version does not need allocations or a new array to store the intermediate results.~~ Apparently, it still does allocation using with zip. See https://github.com/JuliaLang/julia/issues/28752

Unfortunately, I could not stop non-loop version from doing allocations (@views solution still does allocation for creating views and this slows down the process.)

aaowens · January 14, 2019, 7:32pm

Here’s a still fairly good version which avoids the explicit loops. 3 simple steps

Use @views to avoid allocating a new array for each slice
Use dot broadcasting with @. to fuse the computation into 1 loop
Use a generator to avoid allocating a new array when computing the max abs diff.

@views function relax(u::Array{Float64, 2}, unew::Array{Float64, 2})
	@. unew[2:end-1, 2:end-1] = 0.25*(u[1:end-2, 2:end-1] + u[3:end, 2:end-1] +
								   u[2:end-1, 1:end-2] + u[2:end-1, 3:end])
end
function solve(u::Array{Float64, 2}, tol = 1e-6)
	err = 1.0
	it = 1
	unew = copy(u)
	while err > tol
		relax(u, unew)
                err = maximum(abs(x - y) for (x, y) in zip(u, unew))
		u, unew = unew, u
		it += 1
	end

	println("took ", it, " iterations with err = ", err)
end
function run()
	u = zeros(200, 200)
	u[:, end] .= 1.0;
	solve(u)
end
run()
julia> @time run()
took 31475 iterations with err = 9.99900789139252e-7
  7.309059 seconds (220.35 k allocations: 11.657 MiB)

mirzadeh · January 14, 2019, 8:24pm

Thank you so much for all the analysis and also introducing me to TimerOutputs.jl. I’m sure it’ll come handy in the future. Are Julia arrays column-major by default? Is it possible to have C-ordering instead (row-major)?

isaacsas · January 14, 2019, 8:36pm

Yes, julia arrays are column-major like Matlab and Fortran.

I’m not sure if you can change the internal layout, or someone has a package that uses alternative layouts, but it probably makes sense to use Julia’s ordering in Julia code as this is what Julia packages will expect.

bennedich · January 14, 2019, 10:51pm

A few more ideas…

Starting from your optimized code based on the above recommendations, you have this result from @time:

  4.033748 seconds (31.51 k allocations: 1.572 MiB)

Apparently something’s allocating, since you only have two 200 x 200 matrices, which should use:

julia> "$(2 * Base.summarysize(zeros(200, 200))>>10) KB"
"624 KB"

You can easily pinpoint the reason for the allocations to the mapreduce call, to calculate the error. So replace that with a loop instead (your existing absdiff function), and you’ll get:

  5.703612 seconds (34 allocations: 625.906 KiB)

So now we’ve fixed the memory, but why is it suddenly slower? Looking at absdiff (or using code_warntype) you can see this:

du = 0
...
du = max(du, abs(u[i,j] - unew[i,j]))

This initializes du to an integer, but then changes it to a float, causing type instability. It’s easily fixed by:

du = zero(eltype(u))

Similarly, let’s change err = 1 to err = tol*2 (although this doesn’t affect the run time). Now:

  3.822652 seconds (34 allocations: 625.906 KiB)

Now let’s take a look at absdiff, which is the slowest part of your program. Let’s change this max call:

du = max(du, abs(u[i,j] - unew[i,j]))

to an explicit compare and assign:

dif = abs(u[i,j] - unew[i,j])
dif > du && (du = dif)

Big difference:

  2.228196 seconds (36 allocations: 626.906 KiB)

Finally, notice that relax_loop and absdiff both loop over u and unew in the same way. So why not combine them into a single method?

function relax_loop_diff(u::Array{Float64, 2}, unew::Array{Float64, 2})
	nx, ny = size(u)
	du = zero(eltype(u))
	@inbounds for j in 2:ny-1
		@simd for i in 2:nx-1
			unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
			dif = abs(u[i,j] - unew[i,j])
			dif > du && (du = dif)
		end
	end
	du
end

Result:

  1.775929 seconds (34 allocations: 625.906 KiB)

That’s all for now

Seif_Shebl · January 15, 2019, 1:15am

This is awesome, I couldn’t resist comparing against Fortran, and guess what? Your optimized version is 15% faster than the equivalent Fortran version. The times of the two codes on my machine were as follows:

Time: 1.563 s (30 allocations: 625.75 KiB) # Julia
Time: 1.79700005 s                         # Fortran

The GNU Fortran 8.2.0 compiler was used with the -Ofast option to compile the Fortran version.
Here are the two versions for reference:

Julia Version

function relax_loop_diff(u::Array{Float64, 2}, unew::Array{Float64, 2})
    nx, ny = size(u)
    du = zero(eltype(u))
    @inbounds for j in 2:ny-1
        @simd for i in 2:nx-1
            unew[i, j] = 0.25*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
            dif = abs(u[i,j] - unew[i,j])
            dif > du && (du = dif)
        end
    end
    du
end

function solve(u::Array{Float64, 2}, tol = 1e-6)
    err = 1
    it = 1
    unew = copy(u)
    while err > tol
        err = relax_loop_diff(u, unew)
        u, unew = unew, u
        it += 1
    end
    println("took ", it, " iterations with err = ", err)
end

function run()
    u = zeros(200, 200)
    u[:, end] .= 1;
    solve(u)
end

using BenchmarkTools
@btime run()

Fortran version:

program run
implicit none
integer, parameter :: n = 200
real(8) :: u(n,n), unew(n,n), tol = 1e-6, err
integer :: it, t0, t1, count_rate, count_max

call system_clock(t0, count_rate, count_max)
   u = 0
   u(:,n) = 1
   err = 1
   it  = 1
   unew = u
   do while (err > tol)
      unew(2:n-1,2:n-1) = 0.25*(u(:n-2,2:n-1) + u(3:,2:n-1) + u(2:n-1,:n-2) + u(2:n-1,3:))
      err = maxval(abs(u - unew))
      call swap(u, unew)
      it = it + 1
   end do
   print *, "took", it, "iterations with err =", err
call system_clock(t1)
print *, 'Time:', real(t1-t0)/count_rate

contains
elemental subroutine swap(a,b)
    real(8), intent(inout) :: a, b
    real(8) :: temp
    temp=a; a=b; b=temp
end

end program run

dpsanders · January 15, 2019, 5:53am

What happens if you devectorize the Fortran version?

DNF · January 15, 2019, 8:17am

On an unrelated note: when you have a function that mutates one of its arguments, there is a (pretty strict) convention to give it a name ending with !. There is also a (somewhat looser) convention of putting the mutated argument first in the argument list.

So, instead of

function relax_loop(u::Array{Float64, 2}, unew::Array{Float64, 2})

you might use

function relax_loop!(unew::Array{Float64, 2}, u::Array{Float64, 2})

Ronis_BR · January 15, 2019, 12:16pm

Thanks! I really learned a lot today. That’s what I love in Julia, you can code very fast, just like in MATLAB. This is excellent to verify and debug your code. Then, with little modifications, you can turn it into something that is comparable to C/Fortran.

Seif_Shebl · January 15, 2019, 1:28pm

I wrote the devectorized version exactly as Julia’s and it timed the same as Julia, still impressive!

mirzadeh · January 15, 2019, 4:56pm

WOW! I have learned a lot, so thank you

This was a test for me to both learn Julia better and also get a feeling for its performance. I am thoroughly impressed by Julia. Right now, the fastest Julia implementation is very close to C code (only ~5% slower). The python + numba comes second, at ~20% slower, and the Matlab via mex is ~85% slower (although I suspect that suffers a lot from memory reallocations as well).

So all of this leaves me with one question. What is the best practice for writing Julia? Start with built-in functionality (using broadcasts, built-in functions, etc) and then profile incrementally or default to loops and C-like programming from the start and let the jit-compiler do its magic?

Thanks!

ChrisRackauckas · January 15, 2019, 5:07pm

For Julia, I like to think:

Get it working, get it fast

Write it the simplest way possible and get it well tested. Once it’s working, then optimize out operations and arrays. It’s much easier to write fast code when you’re only changing one thing at a time, and when you have a test that tells you your changes broke it .

yuyichao · January 15, 2019, 5:07pm

Use built-in functions if they for exactly what you need but don’t force it like what is recommended in other scripting languages.

zhangliye · January 15, 2019, 5:19pm

I also learned a lot from this post. Thanks! It should be cool If some masters of Julia could write books such as Effective C++ and More effective C++. Really enjoy those books

foobar_lv2 · January 15, 2019, 6:36pm

Two notes:

First, think long and hard whether you really need Float64. If your @simd was successful, then Float32 should be twice as fast. Problems such as yours often come from discretizing a continuous (non-finitary) problem, and your discretization error is often larger than both floating-point errors and errors incurred by using only a finite number of iterations. So extract the discretization error from whatever PDE theorem you where using, don’t set a tighter tolerance than warranted and switch to Float32 whenever possible.

Second, use Float32 to test whether your code actually uses SIMD (you should also read the @code_native, but that takes some time to get used to). You want an almost 2x speed-up; otherwise something is rotten.

julia> using BenchmarkTools

julia> function relax_loop_diff(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               @inbounds for j in 2:ny-1
                       @simd for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               dif = abs(u[i,j] - unew[i,j])
                               dif > du && (du = dif)
                       end
               end
               du
       end;

julia> u64=rand(200,200); u64n = copy(u64); u32 = Float32.(u64); u32n=copy(u32);

julia> @btime relax_loop_diff(u64, u64n)
  90.744 μs (1 allocation: 16 bytes)
0.911639555429068

julia> @btime relax_loop_diff(u32, u32n)
  86.308 μs (1 allocation: 16 bytes)
0.9116396f0

julia> 90*2000/(200*200)
4.5

I’m running this on a 2ghz machine, and the loop takes 4.5 cycles per iteration. This feels absurdly bad. The Float32 code is insignificantly faster than the Float64 code. This tells us that something went wrong.

So let’s do better. It is known that LLVM has problems figuring out that max is associative, and therefore all max-reductions suck. Further, there are sometimes issues with aliasing analysis: The compiler needs to take into account the possibility that u and unew alias, which may prevent use of some vectorized constructions.

Let’s first check the maximum problem:

julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               
               @inbounds for j in 2:ny-1
                       @simd for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               #dif = abs(u[i,j] - unew[i,j])
                               #dif > du && (du = dif)
                       end
               end
               du
       end
julia> @btime relax_loop_diff_ne(u64, u64n)
  35.499 μs (1 allocation: 16 bytes)
0.0
julia> @btime relax_loop_diff_ne(u32, u32n)
  17.127 μs (1 allocation: 16 bytes)
0.0f0
julia> @btime broadcast!(identity, u32n, u32);
  9.433 μs (0 allocations: 0 bytes)
julia> @btime broadcast!(identity, u64n, u64);
  23.927 μs (0 allocations: 0 bytes)

First, this is way faster. Second, we see that SIMD is used. Third, we see that we are within 2x of memcopy speed.

We can next check for the second problem (aliasing) by benchmarking

julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
               nx, ny = size(u)
               du = zero(T)
               
               @inbounds for j in 2:ny-1
                       @simd ivdep for i in 2:nx-1
                               unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
                               #dif = abs(u[i,j] - unew[i,j])
                               #dif > du && (du = dif)
                       end
               end
               du
       end

This gives no significant additional advantage, so no reason to make a fuss about this.

So, what should you do about it? For example, check the error only every couple of iterations. In addition, you could try to make the max faster, possibly using SIMD.jl. In principle Distances.jl should have a reasonably fast implementation of the max-distance (and if not, then open an issue there).

Topic		Replies	Views
Understanding Julia performance in simple finite difference code New to Julia	38	4448	June 7, 2018
Trying to speed-up Laplace Equation Solver Performance	36	1758	August 1, 2020
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Why is Julia's performance more sensitive to memory allocations than Numpy's? Performance	19	1069	September 5, 2024
Trying to understand how to write good code (example) New to Julia	34	3008	February 16, 2017

Poor performance due to memory allocations?

Related topics