Julia & Mojo Mandelbrot Benchmark

Well you have to define a ComplexSIMD. It would be good that this type of reasoning is accessible to newbies, or best hidden to them.

6 Likes

In the Mojo case that’s part of a library, which is loaded on top of the code. That can go to a library in Julia as well. But what I find more interesting here is that the reasoning and code type applies to any composite type of a similar structure, thus we can use that to optimize other code completely unrelated to calculations with complex numbers.

12 Likes

Isn’t this almost like the typical β€œarray-of-structs to struct-of-arrays” transformation, just with the added help to the compiler of pre-chunking to length 4? I guess you might get better memory locality if not all im fields are in one separate vector and all re in the other.

julia> @benchmark compute_mandelbrot!($result)
BenchmarkTools.Trial: 68 samples with 1 evaluation.
 Range (min … max):  14.030 ms …  15.901 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     14.824 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   14.836 ms Β± 576.959 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ˆβ–ˆ        β–…                        β–‚          β–‚β–‚            
  β–ˆβ–β–ˆβ–ˆβ–ˆβ–ˆβ–β–…β–ˆβ–…β–β–ˆβ–ˆβ–β–β–β–β–ˆβ–β–…β–β–β–ˆβ–…β–…β–β–ˆβ–β–…β–…β–β–…β–…β–ˆβ–β–ˆβ–…β–ˆβ–β–β–ˆβ–…β–β–ˆβ–β–…β–…β–β–ˆβ–ˆβ–…β–…β–ˆβ–β–…β–β–β–β–…β–… ▁
  14 ms           Histogram: frequency by time         15.9 ms <

 Memory estimate: 18.69 KiB, allocs estimate: 177.

julia> @benchmark compute_mandelbrot_turbo!($result)
BenchmarkTools.Trial: 404 samples with 1 evaluation.
 Range (min … max):  2.460 ms …  2.946 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.462 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.477 ms Β± 77.247 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

julia> versioninfo()
Julia Version 1.9.3-DEV.0
Commit 6fc1be04ee (2023-07-06 14:55 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 28 Γ— Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 28 on 28 virtual cores

Code:

using LoopVectorization, VectorizationBase

# using Plots 

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

@inline function mandelbrot_kernel(c)
  z = c
  for i = 1:MAX_ITERS
    z = z * z + c
    if abs2(z) > 4
      return i
    end
  end
  return MAX_ITERS
end
@inline function mandelbrot_kernel(r::Vec{W}, i::Vec{W}) where {W}
  z = c = complex(r,i)
  inactive = VectorizationBase.Mask(VectorizationBase.zero_mask(Val(W)))
  imask = VectorizationBase.vbroadcast(Val(W), 0)
  for i = 1:MAX_ITERS
    z = muladd(z, z, c)
    imask = ifelse(inactive, imask, i)
    inactive |= abs2(z) > 4
    VectorizationBase.vall(inactive) && return imask
  end
  return imask
end
@inline function mandelbrot_kernel(r::VecUnroll, i::VecUnroll)
VectorizationBase.VecUnroll(fmap(mandelbrot_kernel, VectorizationBase.data(r),VectorizationBase.data(i)))
end

function compute_mandelbrot!(result)

  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)

  Threads.@threads for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(complex(x, y))
    end
  end
  return result
end
compute_mandelbrot() = compute_mandelbrot!(zeros(yn, xn))
function compute_mandelbrot_turbo!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  
  @tturbo for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end
compute_mandelbrot_turbo() = compute_mandelbrot_turbo!(zeros(yn, xn))
result = zeros(yn, xn);

@benchmark compute_mandelbrot!($result)
@benchmark compute_mandelbrot_turbo!($result)

EDIT:
I get

julia> @benchmark mandelbrot($result, $x_range, $y_range, Val(4))
BenchmarkTools.Trial: 312 samples with 1 evaluation.
 Range (min … max):  3.094 ms …   7.088 ms  β”Š GC (min … max): 0.00% … 52.95%
 Time  (median):     3.146 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.199 ms Β± 438.872 ΞΌs  β”Š GC (mean Β± Οƒ):  1.49% Β±  5.96%

  β–ˆβ–…                                                           
  β–ˆβ–ˆβ–‡β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ β–†
  3.09 ms      Histogram: log(frequency) by time      6.98 ms <

 Memory estimate: 593.08 KiB, allocs estimate: 4843.

from @sdanisch’s example.

I also see better perf with more threads, but LV is limiting itself to 1 per core and not using hyperthreads.

8 Likes

New best:

julia> @benchmark compute_mandelbrot_polyturbo!($result)
BenchmarkTools.Trial: 652 samples with 1 evaluation.
 Range (min … max):  1.494 ms …   4.784 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.520 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.533 ms Β± 147.023 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ƒ        β–‚β–„β–†β–ˆβ–ˆβ–†β–ƒ  ▁         β–‚                               
  β–ˆβ–ˆβ–‡β–„β–β–β–†β–„β–β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–ˆβ–ˆβ–ˆβ–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–‡β–…β–β–‡β–„β–…β–‡β–†β–…β–β–„β–…β–β–β–β–β–β–β–β–β–β–β–„ β–ˆ
  1.49 ms      Histogram: log(frequency) by time       1.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

code

using LoopVectorization, VectorizationBase, Polyester

# using Plots 

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

@inline function mandelbrot_kernel(c)
  z = c
  for i = 1:MAX_ITERS
    z = z * z + c
    if abs2(z) > 4
      return i
    end
  end
  return MAX_ITERS
end
@inline function mandelbrot_kernel(r, i::Vec{W}) where {W}
  z = c = complex(r,i)
  inactive = VectorizationBase.Mask(VectorizationBase.zero_mask(Val(W)))
  imask = VectorizationBase.vbroadcast(Val(W), 0)
  for i = 1:MAX_ITERS
    z = muladd(z, z, c)
    imask = ifelse(inactive, imask, i)
    inactive |= abs2(z) > 4
    VectorizationBase.vall(inactive) && return imask
  end
  return imask
end
@inline function mandelbrot_kernel(r::VecUnroll, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::Vec, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, r, VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::VecUnroll, i::Vec)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),i))
end

function compute_mandelbrot!(result)

  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)

  Threads.@threads for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(complex(x, y))
    end
  end
  return result
end
compute_mandelbrot() = compute_mandelbrot!(zeros(yn, xn))
function compute_mandelbrot_turbo!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  
  @tturbo for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end

function compute_mandelbrot_polyturbo!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  
  @batch per=thread for i = 1:xn
    x = x_range[i]
    @turbo for j = 1:yn
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end
compute_mandelbrot_turbo() = compute_mandelbrot_turbo!(zeros(yn, xn))
compute_mandelbrot_polyturbo() = compute_mandelbrot_polyturbo!(zeros(yn, xn))
result = zeros(yn, xn);

@benchmark compute_mandelbrot!($result)
@benchmark compute_mandelbrot_turbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result)
4 Likes

so that’s 3x faster on top of 8x faster (than original Julia solution)? Which means this should be 6-7x faster than Mojo?

2 Likes

I was curious and wanted to try the benchmark out myself – turns out you can only install Mojo on Ubuntu currently :laughing:

5 Likes

…but Mojo is new and not yet so optimised compared to a long-standing language as Julia… :rofl:

7 Likes

Interesting!
On my PC, my version is still fastest…
I get:

compute_mandelbrot!:           18.218 ms (72 allocations: 9.03 KiB)
compute_mandelbrot_turbo!:     4.224 ms (0 allocations: 0 bytes)
compute_mandelbrot_polyturbo!: 3.799 ms (0 allocations: 0 bytes)
My version:                    3.024 ms (4824 allocations: 682.48 KiB)

Although I’m surprised that my PC seems to be pretty slow in comparison to the other posters here…
I have a Ryzen 9 7900X 12 core, which I thought is one of the faster CPUs on the market.

Edit:
The original version seems to benefit much more from 24 threads…
The first values where with 12 threads, while this is with 24:

@btime compute_mandelbrot!($result);
   9.763 ms (142 allocations: 17.97 KiB)
@btime compute_mandelbrot_turbo!($result);
   4.617 ms (0 allocations: 0 bytes)
@btime compute_mandelbrot_polyturbo!($result);
   2.866 ms (0 allocations: 0 bytes)
@btime mandelbrot(result, x_range, y_range, Val(4));
   2.372 ms (4836 allocations: 682.86 KiB)
3 Likes

Interesting. This is the fastest for me so far (a modification of your version to use @turbo):

julia> @benchmark mandelbrot_turbo!($result, Val(8))
BenchmarkTools.Trial: 804 samples with 1 evaluation.
 Range (min … max):  988.264 ΞΌs …   4.945 ms  β”Š GC (min … max): 0.00% … 72.48%
 Time  (median):       1.187 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.240 ms Β± 425.295 ΞΌs  β”Š GC (mean Β± Οƒ):  3.74% Β±  8.22%

    β–‡β–ˆβ–‚                                                          
  β–„β–„β–ˆβ–ˆβ–ˆβ–†β–„β–„β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„ β–‡
  988 ΞΌs        Histogram: log(frequency) by time       4.77 ms <

Memory estimate: 610.55 KiB, allocs estimate: 5402.

It uses the same mandelbrot_kernel definition from above, and:

function inner_turbo(result, x_range, y_range, x, ::Val{N}) where N
  ci = x_range[x]
  @turbo for j = eachindex(y_range)
    y = y_range[j]
    result[j, x] = mandelbrot_kernel(x, y)
  end
end
function mandelbrot_turbo!(result, x_range, y_range, N)
    @sync for x in eachindex(x_range)
        # Threads.@spawn allows dynamic scheduling instead of static scheduling
        # of Threads.@threads macro.
        # See <https://github.com/JuliaLang/julia/issues/21017>.
        Threads.@spawn @inbounds begin
          inner_turbo(result, x_range, y_range, x, N)
        end
    end
    return result
end

How does this do for you?

Mine is a 9940X, a 14 core with AVX512. That is also why I used Val(8) above.

I don’t have an ubuntu computer at the moment.
@jling, relative results often differ from one computer to another (like how mine differed from @sdanisch), so I wouldn’t be confident in relative performance versus Mojo without trying it on the same computer.

Seems like this really likes the load balancing from spamming all the tasks. I guess there is substantial variation in times.

Iterating over y in a random order also seems to help me (perm is a permutation vector):

julia> @benchmark compute_mandelbrot_polyturbo!($result, $perm)
BenchmarkTools.Trial: 964 samples with 1 evaluation.
 Range (min … max):  859.147 ΞΌs …   5.992 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     866.572 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.035 ms Β± 443.858 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚β–„β–ƒ                                                           
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–β–…β–…β–‡β–‡β–†β–„β–„β–…β–†β–…β–„β–„β–†β–†β–†β–†β–†β–†β–…β–†β–†β–…β–„β–„β–…β–„β–†β–…β–†β–„β–„β–…β–†β–„β–„β–„β–…β–β–„β–β–„β–…β–„β–β–„β–β–„β–β–β–β–„β–„β–„ β–‡
  859 ΞΌs        Histogram: log(frequency) by time       2.81 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

Code:

function compute_mandelbrot_polyturbo!(result, perm)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  
  @batch per=thread for _i = 1:xn
    i = perm[_i]
    x = x_range[i]
    @turbo for j = 1:yn
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end

using Random
perm = randperm(yn);
@benchmark compute_mandelbrot_polyturbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result, $perm)
7 Likes

On a related note, one thing that might make the code faster is to use AVX vectors that are rectangular rather than a line (since the closer the points in the vector are, the more correlated their number of iters will be)

2 Likes

In this code I don’t see any @simd or other similar macro… Just @inbounds - is that enough to tell Julia to automatically apply SIMD here?

I’m just trying to understand how this example works, thanks for creating it!

The compiler is just being smart enough to turn broadcasted NTuple instructions into SIMD here. LLVM is good at it’s job.

4 Likes

The @simd macro is not particularly useful. Most cases where I see people use it, it’s superfluous because they’re either applying it to a loop that has no hope of being vectorized, or they’re applying it to a loop that would already get automatically vectorized.

8 Likes

I tried to install mojo on a ubuntu 20.04 server but the installation was still going on after a couple of hours so I stopped it. I think it’s best to wait a bit for this to iron out.

4 Likes

If this would be about efficiently computing the Mandelbrot set, rather than a comparison with Mojo, then using a different algorithm would give even more speed. Checking for the cardioid or the bulb for starters, then period 1 checking, then Mariani-Silver followed by border tracing seems a good way to go. I’m not a Julia expert, so I can’t contribute code, but Wikipedia has a great overview of possible algorithms Plotting algorithms for the Mandelbrot set - Wikipedia and Rico Mariani explains his algorithm here: The Mariani-Silver Algorithm for Drawing the Mandelbrot Set | by Rico Mariani | Medium

6 Likes

To expand on what others have said, @simd does 1 thing: it allows reassociating instructions in reductions. Normally vectorizing something like a sum would be illegal, because that requires changing the order of your additions.
@simd makes that legal.

@simd ivdep does a little more, promising to LLVM that the loop iterations may be executed in parallel. When vectorization failed because of possible aliasing, then @simd ivdep helps.

Neither of those were a problem for @sdanisch’s code, so neither @simd or @simd ivdep were needed.

8 Likes

Back in the day I had a screensaver for the BeOS that did border checking and splitting, it showed off the (then new) multiprocessing capabilities of BeOS. The algorithm was to start with a square in the plane, divide it in 4 quarters, check the edge pixels, if they were all the same color color the interior a constant color, if not, split the square into 4 squares and repeat until you were checking less than some small number of pixels and then just check all the pixels in the square (I think maybe 8x8 but in modern hardware it might be worth it to be doing like 16x16 or 32x32 or something, at the time the machine was clocked at about 200Mhz or less).

The process operated with a queue (equivalent of a Julia Channel) then N worker threads one for each core. Each worker thread would read a rectangle to check from the queue, check the rectangle, either fill it with a constant color or split it and push the new rectangles into the work queue, or just fill the whole thing checking each pixel when it’s small enough. Drawing would be done in one pixel buffer, and then a separate worker thread would occasionally blit that pixel buffer to the screensaver window… so you’d see the algorithm proceed in real-time.

It was called MandelSaver, this would have been about 1995 or so. Good times.

Anyway, the point of all that was to say that the real advantage to Julia isn’t that you can extract every tiny bit of raw CPU speed out of the CPU, it’s that you can extract efficiencies of expressing more complex algorithms in a way your brain can still understand without sacrificing most of the speed. If you can get 80% of the raw speed, but it lets you use a 200x faster algorithm that’s a huge win. Sometimes raw speed is needed of course, when you’re already using the fastest known algorithm etc.

7 Likes

Am I right in thinking that the main issue for speed here is the branch abs2(z)>4?
In this case, could we make a package to handle this branching in parallel loops?

1 Like

Yes, the main issue is that the branch exits the loop.

Also, that the optimal loop to SIMD is not the inner most loop. LLVM is really bad whenever that’s the case, so it needs help.
SIMD.jl and LoopVectorization.jl can both help there, but LoopVectorization.jl also doesn’t handle early-stopping loops like that branch, which is why it needs a similar workaround for mandelbrot_kernel.

3 Likes