Since not exiting the loop exactly on time in `mandelbrot_kernel`

, does not change the output values (because of `inactive`

mask), perhaps checking the condition less than every time can improve performance. For example, check condition when `i`

is a power of 2.

…or sampling a point in a bunch of vectors to get a guess as to initial loop run without exit checks of closeby points.

…or some sort of function informed branch prediction of this sort.

yeah, I was thinking that. basically do an internal loop:

```
@inline function mandelbrot_kernel(c)
z = c
iesc = 0
for i = 1:8:MAX_ITERS
@turbo for j in 1:8
z = z * z + c
if iesc == 0 && abs2(z) > 4
iesc = i+j # or i+j-1 ? off by one error?
end
end
if iesc > 0
return iesc
end
end
return MAX_ITERS
end
```

Now you’re doing the inner j loop in chunks of exactly 8, and it should be able to turbo/unroll/etc that?

if the branch is still a problem, you could in the inner loop fill a vector of 8 values, and then just compute all 8 and check to see which is the first one that escapes.

something like:

```
@inline function mandelbrot_kernel(c)
z = c
iesc = 0
absvals = fill(0.0,8)
for i = 1:8:MAX_ITERS
@turbo for j in 1:8
z = z * z + c
absvals[j] = abs2(z)
end
jfirst = findfirst(>(4.0),absvals)
if !isnothing(jfirst)
return i + jfirst
end
end
return MAX_ITERS
end
```

It is much better to parallelize an outer loop, and perform multiple `mandelbrot_kernel`

s in parallel like both my code and sdanisch’s code does, than it is to try and parallelize the `mandelbrot_kernel`

’s iterations.

Oh right, the inner_turbo, moving along a scanline makes good sense, each pixel is totally independent.

Almost the only case where `@simd`

does anything is where a loop is computing a floating-point reduction and allowing reassociation of `+`

or `*`

lets the compiler convert a left-to-right reduction into a vectorized reduction. But `sum`

and `prod`

and `reduce`

already do this in a cleverer and faster way, so you’re often better off writing a functional version that’s also easier to read.

It’s going to be machine and size-dependent, but `Base.sum`

lags behind for larger arrays that are still small enough to fit in cache:

```
Benchmarking over lengths 1:2 in a random order:
Base.sum: 15.415 ns (0 allocations: 0 bytes)
@simd sum: 17.703 ns (0 allocations: 0 bytes)
@turbo sum: 31.655 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:4 in a random order:
Base.sum: 25.153 ns (0 allocations: 0 bytes)
@simd sum: 27.272 ns (0 allocations: 0 bytes)
@turbo sum: 38.181 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:8 in a random order:
Base.sum: 46.939 ns (0 allocations: 0 bytes)
@simd sum: 60.607 ns (0 allocations: 0 bytes)
@turbo sum: 67.981 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:16 in a random order:
Base.sum: 106.627 ns (0 allocations: 0 bytes)
@simd sum: 135.112 ns (0 allocations: 0 bytes)
@turbo sum: 116.559 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:32 in a random order:
Base.sum: 391.493 ns (0 allocations: 0 bytes)
@simd sum: 408.817 ns (0 allocations: 0 bytes)
@turbo sum: 240.957 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:64 in a random order:
Base.sum: 1.193 μs (0 allocations: 0 bytes)
@simd sum: 1.038 μs (0 allocations: 0 bytes)
@turbo sum: 492.569 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:128 in a random order:
Base.sum: 2.879 μs (0 allocations: 0 bytes)
@simd sum: 2.239 μs (0 allocations: 0 bytes)
@turbo sum: 1.231 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:256 in a random order:
Base.sum: 7.309 μs (0 allocations: 0 bytes)
@simd sum: 5.779 μs (0 allocations: 0 bytes)
@turbo sum: 2.959 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:512 in a random order:
Base.sum: 19.213 μs (0 allocations: 0 bytes)
@simd sum: 15.818 μs (0 allocations: 0 bytes)
@turbo sum: 8.323 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:1024 in a random order:
Base.sum: 51.967 μs (0 allocations: 0 bytes)
@simd sum: 44.387 μs (0 allocations: 0 bytes)
@turbo sum: 27.572 μs (0 allocations: 0 bytes)
```

This is with a 10980XE Intel CPU.

If we get larger than this, they’ll all be the same slow and memory bound.

Script:

```
using Random, BenchmarkTools, LoopVectorization
x = rand(1024); y = similar(x);
@noinline function sum_simd(x)
s = zero(eltype(x))
@simd for i in eachindex(x)
s += x[i]
end
s
end
@noinline function sum_turbo(x)
s = zero(eltype(x))
@turbo for i in eachindex(x)
s += x[i]
end
s
end
function map_randlengths!(f, y, x, perm)
@assert length(perm) <= min(length(y),length(x))
for i = eachindex(perm)
y[i] = f(@view(x[1:perm[i]]))
end
end
for n = 5:10
l = 2^n
print("Benchmarking over lengths 1:$l in a random order:\nBase.sum: ")
perm = randperm(l);
@btime map_randlengths!(sum, $y, $x, $perm)
print("@simd sum: ")
@btime map_randlengths!(sum_simd, $y, $x, $perm)
print("@turbo sum:")
@btime map_randlengths!(sum_turbo, $y, $x, $perm)
end
```

Can we improve `reduce`

/`sum`

/`prod`

then? After all they should have full license to associate operations however they want. Of course there’s also recursive pairwise summation for accuracy, which may be contributing some overhead.

I think we can improve them without loss of accuracy.

It’s worth asking what the best approach at an implementation is. E.g., we don’t have details about vector length at compile time at the Julia level within Base (and LoopVectorization’s approach is hacky and causes problems for static compilation; we want to cut down how much that shows up in the package ecosystem rather than increase it).

That’s information that LLVM has and uses that’d be important for an optimal implementation.

I don’t think that should be a problem.

In my example on a system with AVX512, for arrays of length >= 64, LoopVectorization.jl was using 64 parallel accumulators.

It then recursively combines these 64 separate sums at the end.

(For smaller arrays, it uses fewer parallel accumulators; it never `foldl`

s. This is why performance was bad for very small arrays, e.g. lengths `1:4`

. But by the time we’re summing all arrays of lengths `1:64`

, it was the fastest approach by a wide margin on my computer.)

- this is similar to a particular recursive pairwise summation algorithm
- it is very fast.

Thus, we can pick a point (that’s a multiple of cacheline size) just below the intersection of where the number of parallel accumulators this algorithm uses, and however many the current pairwise summation implementation uses.

That’d then be the new point of switching from the recursion to the base case.

I don’t know the rate at which the number of parallel accumulators grow for our pairwise summation. It has too many layers of abstraction for me to feel like combing through them at the moment.

However, I suspect it will take a fairly large array before our loop exceeds the maximum number of parallel accumulators that `LoopVectorization.jl`

will use (which is 8*SIMD vector width).

I am guessing that this base case will be very large, much too large to have much overhead.

Especially because it might be that we already end up in the realm of being memory bandwidth dominated by that point.

Furthermore, we should modify the base case to not do the full reduction to a scalar, but to return SIMD vectors of the parallel accumulators. Only at the very end do we do a reduction to a scalar.

EDIT: Do you think something like this would make sense as a short gsoc for next year?

New faster implementation:

```
function compute_mandelbrot_polyturbo_stride!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@batch stride=true for i = 1:xn
c = 13
m = xn
a = m + 1
x = x_range[i]
@turbo for j = 1:yn
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
return result
end
```

I added the `stride=true`

option to Polyester, which performs better here:

```
julia> @benchmark compute_mandelbrot_polyturbo_stride!($result)
BenchmarkTools.Trial: 1181 samples with 1 evaluation.
Range (min … max): 721.232 μs … 5.919 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 728.151 μs ┊ GC (median): 0.00%
Time (mean ± σ): 844.556 μs ± 360.679 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃▄▁▁▃ ▁
███████▆▁▃▄▅▁▁▃▅▇▅▁▁▁▇█▇▅▄▃▆▄▆▇▆▇▅▄▄▅▅▄▅▄▁▁▁▁▁▁▁▃▅▅▆▇▅▅▄▃▃▁▁▄ ▇
721 μs Histogram: log(frequency) by time 1.81 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark compute_mandelbrot_polyturbo!($result)
BenchmarkTools.Trial: 645 samples with 1 evaluation.
Range (min … max): 1.491 ms … 5.557 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.517 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.550 ms ± 202.571 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▅▅▂▂
██████▆▆█▇█▅▆▁▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▇
1.49 ms Histogram: log(frequency) by time 2.26 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mandelbrot!($result, Val(8))
BenchmarkTools.Trial: 253 samples with 1 evaluation.
Range (min … max): 3.793 ms … 13.201 ms ┊ GC (min … max): 0.00% … 67.68%
Time (median): 3.865 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.957 ms ± 719.601 μs ┊ GC (mean ± σ): 1.79% ± 6.30%
▁ ▂█▆▆▅█ ▄▆
▄▃█▇██████▇██▇▅▄▄▆▄▅▄▁▄▃▄▃▃▁▃▁▃▄▃▃▁▁▃▁▁▃▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▃ ▃
3.79 ms Histogram: frequency by time 4.25 ms <
Memory estimate: 598.83 KiB, allocs estimate: 5027.
julia> @benchmark mandelbrot_turbo!($result, Val(8))
BenchmarkTools.Trial: 796 samples with 1 evaluation.
Range (min … max): 1.115 ms … 4.934 ms ┊ GC (min … max): 0.00% … 70.76%
Time (median): 1.202 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.253 ms ± 429.366 μs ┊ GC (mean ± σ): 3.80% ± 8.28%
█
██▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
1.11 ms Histogram: frequency by time 4.83 ms <
Memory estimate: 610.58 KiB, allocs estimate: 5403.
```

It’ll be released as part of Polyester 0.7.6 shortly; PR: Strided pattern support by chriselrod · Pull Request #118 · JuliaSIMD/Polyester.jl · GitHub

On my laptop with a 4-core Intel i5 I get 40ms on one thread, 19 ms with `polyturbo!`

and 11 ms with :

```
function compute_mandelbrot_polyturbo_slice!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@batch per = thread for istart = 1:160
for islice = 1:div(xn, 160)
i = (islice - 1) * 160 + istart
x = x_range[i]
@turbo for j = 1:yn
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
end
return result
end
```

which distributes the workload more evenly on threads, maybe like `stride=true`

does (not tried).

Now I can get down to *5ms* (which is what Mojo achieves on my machine) with SIMD.jl

```
using BenchmarkTools
using Plots
using SIMD: Vec, VecRange, vifelse
@inline mulsub(x, y, z) = muladd(x, y, -z) # LLVM should optimize that to a single instruction
@inline mandelbrot_function((zr, zi), (cr, ci)) =
(mulsub(zr, zr, mulsub(zi, zi, cr)), muladd(2zr, zi, ci))
#============= SIMD ================#
function mandelbrot(cr::Real, ci::Vec{N}, max_iters, max_abs2) where {N}
vec(value::T) where {T} = Vec{N,T}(value)
max_abs2 = typeof(ci)(max_abs2)
cr, count, mask_out = map(vec, (cr, Int32(1), false))
zr, zi = cr, ci
for _ = 1:max_iters
all(mask_out) && break
mask_out |= @fastmath zr * zr + zi * zi > max_abs2
zr, zi = mandelbrot_function((zr, zi), (cr, ci))
count = vifelse(mask_out, count, count + 1)
end
return count
end
function sample_line!(::Val{vlen}, nj, i, result, x, y, max_iters, max_abs2) where {vlen}
for jstart = 1:vlen:nj
j = VecRange{vlen}(jstart)
result[j, i] = mandelbrot(x[i], y[j], max_iters, max_abs2)
end
return nothing
end
function sample!(result, x, y, max_iters, max_abs2, vlen::Val = Val(1), stripe_size = 0)
if stripe_size > 0 # distribute each stripe over threads
Threads.@threads :static for ii = 1:stripe_size
nstripe = div(size(result, 2), stripe_size) # divide `1:ni` into `nstripe` stripes of size `stripe_size`
for ss = 1:nstripe
i = ii + (ss - 1) * stripe_size # ss = stripe index
sample_line!(vlen, size(result, 1), i, result, x, y, max_iters, max_abs2)
end
end
else
for i = 1:size(result, 2)
sample_line!(vlen, size(result, 1), i, result, x, y, max_iters, max_abs2)
end
end
end
xn, yn = 960, 960
xmin, xmax = -2.0, 0.6
ymin, ymax = -1.5, 1.5
abs2_max, iter_max = 4.0, 200
x = Array(range(xmin, xmax, xn))
y = Array(range(ymin, ymax, yn))
result = zeros(Int32, yn, xn)
display(@benchmark sample!(result, x, y, iter_max, abs2_max, Val(16), 160))
display(heatmap(@. log(1 + result)))
```

Notice the vector length of 16, which is 4x my native vector length for Float64 (avx256).

So what is the current comparison on Mojo vs Julia’s best implementation?

We don’t know, most people here haven’t installed Mojo because it’s a pain in the butt to install. If you have a copy of Mojo installed and a copy of Julia installed though, I’d encourage you to run the benchmarks and report back.

Mojo is not so hard to install but you need to register with Modular. On my unremarkable laptop Mojo is about 10% faster than the best Julia variant.

your SIMD.jl version is bit faster than Mojo for me (unremarkable laptop as well).

Mojo is around 5.5ms and your version is around 4.1-4.8ms

Same here: @dubosipsl 's takes 5ms and mojo takes 5.3ms. Both of which are much faster than other codes in this thread.

Having said that Mojo wins because you can use the emoji as a file extension.

I wonder if we can update the benchmark games code? I guess that we would need to get rid of the SIMD dependency.

If I were to try and build a complex valued or Quaternion valued neural network, which approach would be the better one? (array of Quaternion valued structs, or the other way around Quaternion Valued Struct of arrays).

I find it surprising people seldom try these type of networks. The search space is pruned nicer and many problems elegantly fit into these paradigms. PyTorch has complex valued neural network support (and I tried it for my project, though didn’t have time to optimize for convergence - these networks don’t have the strong heuristics developed for their convergence).