Two notes:
First, think long and hard whether you really need Float64
. If your @simd
was successful, then Float32
should be twice as fast. Problems such as yours often come from discretizing a continuous (non-finitary) problem, and your discretization error is often larger than both floating-point errors and errors incurred by using only a finite number of iterations. So extract the discretization error from whatever PDE theorem you where using, donβt set a tighter tolerance than warranted and switch to Float32
whenever possible.
Second, use Float32
to test whether your code actually uses SIMD (you should also read the @code_native, but that takes some time to get used to). You want an almost 2x speed-up; otherwise something is rotten.
julia> using BenchmarkTools
julia> function relax_loop_diff(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
dif = abs(u[i,j] - unew[i,j])
dif > du && (du = dif)
end
end
du
end;
julia> u64=rand(200,200); u64n = copy(u64); u32 = Float32.(u64); u32n=copy(u32);
julia> @btime relax_loop_diff(u64, u64n)
90.744 ΞΌs (1 allocation: 16 bytes)
0.911639555429068
julia> @btime relax_loop_diff(u32, u32n)
86.308 ΞΌs (1 allocation: 16 bytes)
0.9116396f0
julia> 90*2000/(200*200)
4.5
Iβm running this on a 2ghz machine, and the loop takes 4.5 cycles per iteration. This feels absurdly bad. The Float32
code is insignificantly faster than the Float64
code. This tells us that something went wrong.
So letβs do better. It is known that LLVM has problems figuring out that max
is associative, and therefore all max
-reductions suck. Further, there are sometimes issues with aliasing analysis: The compiler needs to take into account the possibility that u and unew alias, which may prevent use of some vectorized constructions.
Letβs first check the maximum problem:
julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
#dif = abs(u[i,j] - unew[i,j])
#dif > du && (du = dif)
end
end
du
end
julia> @btime relax_loop_diff_ne(u64, u64n)
35.499 ΞΌs (1 allocation: 16 bytes)
0.0
julia> @btime relax_loop_diff_ne(u32, u32n)
17.127 ΞΌs (1 allocation: 16 bytes)
0.0f0
julia> @btime broadcast!(identity, u32n, u32);
9.433 ΞΌs (0 allocations: 0 bytes)
julia> @btime broadcast!(identity, u64n, u64);
23.927 ΞΌs (0 allocations: 0 bytes)
First, this is way faster. Second, we see that SIMD is used. Third, we see that we are within 2x of memcopy speed.
We can next check for the second problem (aliasing) by benchmarking
julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd ivdep for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
#dif = abs(u[i,j] - unew[i,j])
#dif > du && (du = dif)
end
end
du
end
This gives no significant additional advantage, so no reason to make a fuss about this.
So, what should you do about it? For example, check the error only every couple of iterations. In addition, you could try to make the max faster, possibly using SIMD.jl. In principle Distances.jl
should have a reasonably fast implementation of the max-distance (and if not, then open an issue there).