Two notes:

First, think long and hard whether you really need `Float64`

. If your `@simd`

was successful, then `Float32`

should be twice as fast. Problems such as yours often come from discretizing a continuous (non-finitary) problem, and your discretization error is often larger than both floating-point errors and errors incurred by using only a finite number of iterations. So extract the discretization error from whatever PDE theorem you where using, donβt set a tighter tolerance than warranted and switch to `Float32`

whenever possible.

Second, use `Float32`

to test whether your code actually uses SIMD (you should also read the @code_native, but that takes some time to get used to). You want an almost 2x speed-up; otherwise something is rotten.

```
julia> using BenchmarkTools
julia> function relax_loop_diff(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
dif = abs(u[i,j] - unew[i,j])
dif > du && (du = dif)
end
end
du
end;
julia> u64=rand(200,200); u64n = copy(u64); u32 = Float32.(u64); u32n=copy(u32);
julia> @btime relax_loop_diff(u64, u64n)
90.744 ΞΌs (1 allocation: 16 bytes)
0.911639555429068
julia> @btime relax_loop_diff(u32, u32n)
86.308 ΞΌs (1 allocation: 16 bytes)
0.9116396f0
julia> 90*2000/(200*200)
4.5
```

Iβm running this on a 2ghz machine, and the loop takes 4.5 cycles per iteration. This feels absurdly bad. The `Float32`

code is insignificantly faster than the `Float64`

code. This tells us that something went wrong.

So letβs do better. It is known that LLVM has problems figuring out that `max`

is associative, and therefore all `max`

-reductions suck. Further, there are sometimes issues with aliasing analysis: The compiler needs to take into account the possibility that u and unew alias, which may prevent use of some vectorized constructions.

Letβs first check the maximum problem:

```
julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
#dif = abs(u[i,j] - unew[i,j])
#dif > du && (du = dif)
end
end
du
end
julia> @btime relax_loop_diff_ne(u64, u64n)
35.499 ΞΌs (1 allocation: 16 bytes)
0.0
julia> @btime relax_loop_diff_ne(u32, u32n)
17.127 ΞΌs (1 allocation: 16 bytes)
0.0f0
julia> @btime broadcast!(identity, u32n, u32);
9.433 ΞΌs (0 allocations: 0 bytes)
julia> @btime broadcast!(identity, u64n, u64);
23.927 ΞΌs (0 allocations: 0 bytes)
```

First, this is way faster. Second, we see that SIMD is used. Third, we see that we are within 2x of memcopy speed.

We can next check for the second problem (aliasing) by benchmarking

```
julia> function relax_loop_diff_ne(u::Array{T, 2}, unew::Array{T, 2}) where T
nx, ny = size(u)
du = zero(T)
@inbounds for j in 2:ny-1
@simd ivdep for i in 2:nx-1
unew[i, j] = T(0.25)*(u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1])
#dif = abs(u[i,j] - unew[i,j])
#dif > du && (du = dif)
end
end
du
end
```

This gives no significant additional advantage, so no reason to make a fuss about this.

So, what should you do about it? For example, check the error only every couple of iterations. In addition, you could try to make the max faster, possibly using SIMD.jl. In principle `Distances.jl`

should have a reasonably fast implementation of the max-distance (and if not, then open an issue there).