Julia & Mojo Mandelbrot Benchmark

Since not exiting the loop exactly on time in mandelbrot_kernel, does not change the output values (because of inactive mask), perhaps checking the condition less than every time can improve performance. For example, check condition when i is a power of 2.
…or sampling a point in a bunch of vectors to get a guess as to initial loop run without exit checks of closeby points.
…or some sort of function informed branch prediction of this sort.

yeah, I was thinking that. basically do an internal loop:

@inline function mandelbrot_kernel(c)
  z = c
  iesc = 0
  for i = 1:8:MAX_ITERS
    @turbo for j in 1:8
       z = z * z + c
       if iesc == 0 && abs2(z) > 4
         iesc = i+j # or i+j-1 ? off by one error?
       end
     end
       if iesc > 0
          return iesc
       end
   end
  return MAX_ITERS
end

Now you’re doing the inner j loop in chunks of exactly 8, and it should be able to turbo/unroll/etc that?

if the branch is still a problem, you could in the inner loop fill a vector of 8 values, and then just compute all 8 and check to see which is the first one that escapes.

something like:

@inline function mandelbrot_kernel(c)
  z = c
  iesc = 0
  absvals = fill(0.0,8)
  for i = 1:8:MAX_ITERS
    @turbo for j in 1:8
       z = z * z + c
       absvals[j] = abs2(z)
     end
     jfirst = findfirst(>(4.0),absvals)
     if !isnothing(jfirst)
        return i + jfirst
     end
   end
  return MAX_ITERS
end

It is much better to parallelize an outer loop, and perform multiple mandelbrot_kernels in parallel like both my code and sdanisch’s code does, than it is to try and parallelize the mandelbrot_kernel’s iterations.

2 Likes

Oh right, the inner_turbo, moving along a scanline makes good sense, each pixel is totally independent.

Almost the only case where @simd does anything is where a loop is computing a floating-point reduction and allowing reassociation of + or * lets the compiler convert a left-to-right reduction into a vectorized reduction. But sum and prod and reduce already do this in a cleverer and faster way, so you’re often better off writing a functional version that’s also easier to read.

7 Likes

It’s going to be machine and size-dependent, but Base.sum lags behind for larger arrays that are still small enough to fit in cache:

Benchmarking over lengths 1:2 in a random order:
Base.sum:    15.415 ns (0 allocations: 0 bytes)
@simd sum:   17.703 ns (0 allocations: 0 bytes)
@turbo sum:  31.655 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:4 in a random order:
Base.sum:    25.153 ns (0 allocations: 0 bytes)
@simd sum:   27.272 ns (0 allocations: 0 bytes)
@turbo sum:  38.181 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:8 in a random order:
Base.sum:    46.939 ns (0 allocations: 0 bytes)
@simd sum:   60.607 ns (0 allocations: 0 bytes)
@turbo sum:  67.981 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:16 in a random order:
Base.sum:    106.627 ns (0 allocations: 0 bytes)
@simd sum:   135.112 ns (0 allocations: 0 bytes)
@turbo sum:  116.559 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:32 in a random order:
Base.sum:    391.493 ns (0 allocations: 0 bytes)
@simd sum:   408.817 ns (0 allocations: 0 bytes)
@turbo sum:  240.957 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:64 in a random order:
Base.sum:    1.193 μs (0 allocations: 0 bytes)
@simd sum:   1.038 μs (0 allocations: 0 bytes)
@turbo sum:  492.569 ns (0 allocations: 0 bytes)
Benchmarking over lengths 1:128 in a random order:
Base.sum:    2.879 μs (0 allocations: 0 bytes)
@simd sum:   2.239 μs (0 allocations: 0 bytes)
@turbo sum:  1.231 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:256 in a random order:
Base.sum:    7.309 μs (0 allocations: 0 bytes)
@simd sum:   5.779 μs (0 allocations: 0 bytes)
@turbo sum:  2.959 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:512 in a random order:
Base.sum:    19.213 μs (0 allocations: 0 bytes)
@simd sum:   15.818 μs (0 allocations: 0 bytes)
@turbo sum:  8.323 μs (0 allocations: 0 bytes)
Benchmarking over lengths 1:1024 in a random order:
Base.sum:    51.967 μs (0 allocations: 0 bytes)
@simd sum:   44.387 μs (0 allocations: 0 bytes)
@turbo sum:  27.572 μs (0 allocations: 0 bytes)

This is with a 10980XE Intel CPU.

If we get larger than this, they’ll all be the same slow and memory bound.

Script:

using Random, BenchmarkTools, LoopVectorization
x = rand(1024); y = similar(x);

@noinline function sum_simd(x)
    s = zero(eltype(x))
    @simd for i in eachindex(x)
        s += x[i]
    end
    s
end
@noinline function sum_turbo(x)
    s = zero(eltype(x))
    @turbo for i in eachindex(x)
        s += x[i]
    end
    s
end
function map_randlengths!(f, y, x, perm)
  @assert length(perm) <= min(length(y),length(x))
  for i = eachindex(perm)
    y[i] = f(@view(x[1:perm[i]]))
  end
end
for n = 5:10
  l = 2^n
  print("Benchmarking over lengths 1:$l in a random order:\nBase.sum:  ")
  perm = randperm(l);
  @btime map_randlengths!(sum, $y, $x, $perm)
  print("@simd sum: ")
  @btime map_randlengths!(sum_simd, $y, $x, $perm)
  print("@turbo sum:")
  @btime map_randlengths!(sum_turbo, $y, $x, $perm)
end
7 Likes

Can we improve reduce/sum/prod then? After all they should have full license to associate operations however they want. Of course there’s also recursive pairwise summation for accuracy, which may be contributing some overhead.

7 Likes

I think we can improve them without loss of accuracy.
It’s worth asking what the best approach at an implementation is. E.g., we don’t have details about vector length at compile time at the Julia level within Base (and LoopVectorization’s approach is hacky and causes problems for static compilation; we want to cut down how much that shows up in the package ecosystem rather than increase it).
That’s information that LLVM has and uses that’d be important for an optimal implementation.

I don’t think that should be a problem.
In my example on a system with AVX512, for arrays of length >= 64, LoopVectorization.jl was using 64 parallel accumulators.
It then recursively combines these 64 separate sums at the end.

(For smaller arrays, it uses fewer parallel accumulators; it never foldls. This is why performance was bad for very small arrays, e.g. lengths 1:4. But by the time we’re summing all arrays of lengths 1:64, it was the fastest approach by a wide margin on my computer.)

  1. this is similar to a particular recursive pairwise summation algorithm
  2. it is very fast.

Thus, we can pick a point (that’s a multiple of cacheline size) just below the intersection of where the number of parallel accumulators this algorithm uses, and however many the current pairwise summation implementation uses.
That’d then be the new point of switching from the recursion to the base case.

I don’t know the rate at which the number of parallel accumulators grow for our pairwise summation. It has too many layers of abstraction for me to feel like combing through them at the moment.
However, I suspect it will take a fairly large array before our loop exceeds the maximum number of parallel accumulators that LoopVectorization.jl will use (which is 8*SIMD vector width).

I am guessing that this base case will be very large, much too large to have much overhead.
Especially because it might be that we already end up in the realm of being memory bandwidth dominated by that point.

Furthermore, we should modify the base case to not do the full reduction to a scalar, but to return SIMD vectors of the parallel accumulators. Only at the very end do we do a reduction to a scalar.

EDIT: Do you think something like this would make sense as a short gsoc for next year?

6 Likes

New faster implementation:

function compute_mandelbrot_polyturbo_stride!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
    @batch stride=true for i = 1:xn
      c = 13
      m = xn
      a = m + 1
      x = x_range[i]
    @turbo for j = 1:yn
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end

I added the stride=true option to Polyester, which performs better here:

julia> @benchmark compute_mandelbrot_polyturbo_stride!($result)
BenchmarkTools.Trial: 1181 samples with 1 evaluation.
 Range (min … max):  721.232 μs …   5.919 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     728.151 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   844.556 μs ± 360.679 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▄▁▁▃                ▁                                        
  ███████▆▁▃▄▅▁▁▃▅▇▅▁▁▁▇█▇▅▄▃▆▄▆▇▆▇▅▄▄▅▅▄▅▄▁▁▁▁▁▁▁▃▅▅▆▇▅▅▄▃▃▁▁▄ ▇
  721 μs        Histogram: log(frequency) by time       1.81 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark compute_mandelbrot_polyturbo!($result)
BenchmarkTools.Trial: 645 samples with 1 evaluation.
 Range (min … max):  1.491 ms …   5.557 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.517 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.550 ms ± 202.571 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▅▅▂▂                                                       
  ██████▆▆█▇█▅▆▁▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▇
  1.49 ms      Histogram: log(frequency) by time      2.26 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mandelbrot!($result, Val(8))
BenchmarkTools.Trial: 253 samples with 1 evaluation.
 Range (min … max):  3.793 ms …  13.201 ms  ┊ GC (min … max): 0.00% … 67.68%
 Time  (median):     3.865 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.957 ms ± 719.601 μs  ┊ GC (mean ± σ):  1.79% ±  6.30%

    ▁ ▂█▆▆▅█ ▄▆                                                
  ▄▃█▇██████▇██▇▅▄▄▆▄▅▄▁▄▃▄▃▃▁▃▁▃▄▃▃▁▁▃▁▁▃▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▃ ▃
  3.79 ms         Histogram: frequency by time        4.25 ms <

 Memory estimate: 598.83 KiB, allocs estimate: 5027.

julia> @benchmark mandelbrot_turbo!($result, Val(8))
BenchmarkTools.Trial: 796 samples with 1 evaluation.
 Range (min … max):  1.115 ms …   4.934 ms  ┊ GC (min … max): 0.00% … 70.76%
 Time  (median):     1.202 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.253 ms ± 429.366 μs  ┊ GC (mean ± σ):  3.80% ±  8.28%

   █                                                           
  ██▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.11 ms         Histogram: frequency by time        4.83 ms <

 Memory estimate: 610.58 KiB, allocs estimate: 5403.

It’ll be released as part of Polyester 0.7.6 shortly; PR: Strided pattern support by chriselrod · Pull Request #118 · JuliaSIMD/Polyester.jl · GitHub

16 Likes

On my laptop with a 4-core Intel i5 I get 40ms on one thread, 19 ms with polyturbo! and 11 ms with :

function compute_mandelbrot_polyturbo_slice!(result)
    x_range = range(xmin, xmax, xn)
    y_range = range(ymin, ymax, xn)

    @batch per = thread for istart = 1:160
        for islice = 1:div(xn, 160)
            i = (islice - 1) * 160 + istart
            x = x_range[i]
            @turbo for j = 1:yn
                y = y_range[j]
                result[j, i] = mandelbrot_kernel(x, y)
            end
        end
    end
    return result
end

which distributes the workload more evenly on threads, maybe like stride=true does (not tried).

Now I can get down to 5ms (which is what Mojo achieves on my machine) with SIMD.jl

using BenchmarkTools
using Plots
using SIMD: Vec, VecRange, vifelse

@inline mulsub(x, y, z) = muladd(x, y, -z) # LLVM should optimize that to a single instruction
@inline mandelbrot_function((zr, zi), (cr, ci)) =
    (mulsub(zr, zr, mulsub(zi, zi, cr)), muladd(2zr, zi, ci))

#============= SIMD ================#

function mandelbrot(cr::Real, ci::Vec{N}, max_iters, max_abs2) where {N}
    vec(value::T) where {T} = Vec{N,T}(value)
    max_abs2 = typeof(ci)(max_abs2)
    cr, count, mask_out = map(vec, (cr, Int32(1), false))
    zr, zi = cr, ci
    for _ = 1:max_iters
        all(mask_out) && break
        mask_out |= @fastmath zr * zr + zi * zi > max_abs2
        zr, zi = mandelbrot_function((zr, zi), (cr, ci))
        count = vifelse(mask_out, count, count + 1)
    end
    return count
end

function sample_line!(::Val{vlen}, nj, i, result, x, y, max_iters, max_abs2) where {vlen}
    for jstart = 1:vlen:nj
        j = VecRange{vlen}(jstart)
        result[j, i] = mandelbrot(x[i], y[j], max_iters, max_abs2)
    end
    return nothing
end

function sample!(result, x, y, max_iters, max_abs2, vlen::Val = Val(1), stripe_size = 0)
    if stripe_size > 0 # distribute each stripe over threads
        Threads.@threads :static for ii = 1:stripe_size
            nstripe = div(size(result, 2), stripe_size) # divide `1:ni` into `nstripe` stripes of size `stripe_size`
            for ss = 1:nstripe
                i = ii + (ss - 1) * stripe_size # ss = stripe index
                sample_line!(vlen, size(result, 1), i, result, x, y, max_iters, max_abs2)
            end
        end
    else
        for i = 1:size(result, 2)
            sample_line!(vlen, size(result, 1), i, result, x, y, max_iters, max_abs2)
        end
    end
end

xn, yn = 960, 960
xmin, xmax = -2.0, 0.6
ymin, ymax = -1.5, 1.5
abs2_max, iter_max = 4.0, 200

x = Array(range(xmin, xmax, xn))
y = Array(range(ymin, ymax, yn))
result = zeros(Int32, yn, xn)

display(@benchmark sample!(result, x, y, iter_max, abs2_max, Val(16), 160))
display(heatmap(@. log(1 + result)))

Notice the vector length of 16, which is 4x my native vector length for Float64 (avx256).

11 Likes

So what is the current comparison on Mojo vs Julia’s best implementation?

3 Likes

We don’t know, most people here haven’t installed Mojo because it’s a pain in the butt to install. If you have a copy of Mojo installed and a copy of Julia installed though, I’d encourage you to run the benchmarks and report back.

8 Likes

Mojo is not so hard to install but you need to register with Modular. On my unremarkable laptop Mojo is about 10% faster than the best Julia variant.

6 Likes

your SIMD.jl version is bit faster than Mojo for me (unremarkable laptop as well).
Mojo is around 5.5ms and your version is around 4.1-4.8ms

7 Likes

Same here: @dubosipsl 's takes 5ms and mojo takes 5.3ms. Both of which are much faster than other codes in this thread.

Having said that Mojo wins because you can use the :fire: emoji as a file extension.

I wonder if we can update the benchmark games code? I guess that we would need to get rid of the SIMD dependency.

4 Likes

image

41 Likes

If I were to try and build a complex valued or Quaternion valued neural network, which approach would be the better one? (array of Quaternion valued structs, or the other way around Quaternion Valued Struct of arrays).

I find it surprising people seldom try these type of networks. The search space is pruned nicer and many problems elegantly fit into these paradigms. PyTorch has complex valued neural network support (and I tried it for my project, though didn’t have time to optimize for convergence - these networks don’t have the strong heuristics developed for their convergence).

2 Likes

12 posts were split to a new topic: Why doesn’t Julia fma automatically?