Hard to beat Numba / JAX loop for generating Mandelbrot

Changing algorithm isn’t fair for comparison to Numba and JAX and from my benchmark it didn’t make much difference for my memory speed

Complex will use muladd with @fastmath soon. make `@fastmath` `Complex` multiplication use `muladd` by oscardssmith · Pull Request #45676 · JuliaLang/julia · GitHub

7 Likes

can we let it use fma by default without @fastmath? I don’t know exactly the nuance, but muladd doc says sometimes it’s implemented as fma.

I also remember that if you use muladd directly instead of fma, if the hardware doesn’t have it, it would be very slow?

julia> @benchmark run4(2000,3000)
BenchmarkTools.Trial: 187 samples with 1 evaluation.
 Range (min … max):  25.048 ms … 29.405 ms  ┊ GC (min … max): 0.00% … 2.95%
 Time  (median):     27.816 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.838 ms ±  1.716 ms  ┊ GC (mean ± σ):  2.25% ± 3.49%

  ▁█                                    ▅    ▄              ▅
  ██▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▄█▆▅▁▁▄▁▁▁▁▁▁▁▁▄█ ▄
  25 ms        Histogram: log(frequency) by time      29.3 ms <

 Memory estimate: 22.89 MiB, allocs estimate: 2.

julia> @benchmark run_julia(2000,3000)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  273.904 ms … 285.103 ms  ┊ GC (min … max): 1.36% … 0.26%
 Time  (median):     276.711 ms               ┊ GC (median):    1.35%
 Time  (mean ± σ):   277.268 ms ±   2.543 ms  ┊ GC (mean ± σ):  1.30% ± 0.25%

           ▃ ▃█     █ ▃
  ▇▁▁▁▁▁▁▁▁█▇██▁▁▇▁▇█▇█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  274 ms           Histogram: frequency by time          285 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 4.

This is single threaded. Code:

julia> macro vp(expr)
            nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
            if expr.head != :for
                error("Syntax error: loopinfo needs a for loop")
            end
            push!(expr.args[2].args, Expr(:loopinfo, nodes))
            return esc(expr)
       end

julia> function run4(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           fractal = fill(Int32(20), height, width)
           @inbounds @fastmath for w in 1:width
               @vp for h in 1:height
                   z_re = _c_re = x[w]
                   z_im = _c_im = y[h]
                   m = true
                   Base.Cartesian.@nexprs 20 i -> begin
                       z_re,z_im = _c_re + z_re*z_re - z_im*z_im, _c_im + 2*z_re*z_im
                       az4 = (z_re*z_re + z_im*z_im) > 4f0
                       fractal[h, w] = ifelse(m & az4,i%Int32,fractal[h,w])
                       m &= (!az4)
                   end
               end
           end
           return fractal
       end
This also works quite well with awkward combinations w/ respect to vector length
julia> @benchmark run_julia(10,10)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.316 μs … 189.292 μs  ┊ GC (min … max): 0.00% … 95.85%
 Time  (median):     3.414 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.580 μs ±   2.479 μs  ┊ GC (mean ± σ):  0.95% ±  1.36%

   ▅ ▆█                ▁
  ▅█▆██▇▄▃▂▂▂▂▂▂▄▆▃▃▂▃▇█▆▄▃▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  3.32 μs         Histogram: frequency by time        4.46 μs <

 Memory estimate: 1.36 KiB, allocs estimate: 2.

julia> @benchmark run4(10,10)
BenchmarkTools.Trial: 10000 samples with 57 evaluations.
 Range (min … max):  896.930 ns …  23.460 μs  ┊ GC (min … max): 0.00% … 92.93%
 Time  (median):     905.561 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   936.896 ns ± 423.935 ns  ┊ GC (mean ± σ):  0.97% ±  2.06%

  ▅██▅▄▃   ▁▃▃▁                ▃▃▁ ▂▄▂▁▁▁                       ▂
  ████████▇█████▆▅▅▃▃▃▃▁▃▄▁▆█▆▄█████████████▇▇▅▆▄▅▅▅▃▄▃▄▅▅▄▆▆▇▇ █
  897 ns        Histogram: log(frequency) by time       1.12 μs <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark run_julia(16,16)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.377 μs … 960.083 μs  ┊ GC (min … max): 0.00% … 93.93%
 Time  (median):     8.663 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.933 μs ±   9.612 μs  ┊ GC (mean ± σ):  1.01% ±  0.94%

     █▃   ▁▁
  ▁▂████▅▇██▆▄▃▂▂▂▂▂▄▆▇▆▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁ ▂
  8.38 μs         Histogram: frequency by time        9.96 μs <

 Memory estimate: 3.19 KiB, allocs estimate: 2.

julia> @benchmark run4(16,16)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.236 μs … 147.251 μs  ┊ GC (min … max): 0.00% … 97.03%
 Time  (median):     1.306 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.438 μs ±   1.935 μs  ┊ GC (mean ± σ):  1.86% ±  1.38%

     ▅█▆▃▂▂▁      ▅▄▁    ▅▆▃▂   ▃▃▂▁                          ▂
  ▄▁▁███████▇▅▄▁▁▁███▆▇▅▇███████████████▇▇▆▆▆▆▆▅▆▅▆▆▆▆▅▅▄▅▃▄▅ █
  1.24 μs      Histogram: log(frequency) by time         2 μs <

 Memory estimate: 1.06 KiB, allocs estimate: 1.
8 Likes

This is with @fastmath so we can use fma. The reason I used muladd instead of fma is so that hardware without fma won’t be slow.

1 Like

ok so I remembered backward, we should always use muladd instead of fma

well that is cheating, in reality what if they bump cutoff to 200? I mean this is just compiler “cheating”

I learned just today that it is now possible to use multithreading with numba, by using the decorator option @jit(parallel = True) an by replacing the range function in the loops with numba.prange. I wonder how this would compare against julia’s @Threads.threads

3 Likes

The compiler is not smart enough to figure out what to do. Ideally, we wouldn’t unroll that loop at all.
But if we leave it as a loop, the compiler fails to optimize and vectorize outer loops.

The dependencies are also too complicated for old LoopVectorization to understand, but the rewrite should be able to optimize this correctly.

3 Likes

There are times when you really need an fma, but if you just care about the performance muladd is slightly safer (that said, other than M1+rosetta, pretty much any modern computer has fma)

1 Like

right, also that’s beyond the point, benchmarking single-thread performance is perfectly fine

1 Like

so to correct my question: can we use muladd for the case where @fastmath is off? because we’re also encouraging people to not use @fastmath

The one downside with doing so is that the current algorithm does a better job of correctly returning 0.0 for the real part when the real part of the answer should be zero. I’m not sure overall how I would feel about changing it.

2 Likes

tldr so far: fma/muladd helps and we can probably beat Numba with reasonable carefulness + direct translation and not using @fastmath

But can we beat JAX? which seems to be much faster

Are you willing to use SIMD.jl?

yeah, I tried to think of a way to use LoopVectorization but couldn’t

BenchmarksGame.jl/mandelbrot-fast.jl at master · JuliaPerf/BenchmarksGame.jl · GitHub might have something

2 Likes

I already commented on this. Both benchmarks should be fixed. That allocation is both ugly, confusing and pointless. It’s an abomination, even if it doesn’t impact performance much.

you can tell that to Numpy because entire Numpy usage dictates that there will be a lot of intermediate allocation, and also it doesn’t help we can’t beat their “abomination” version :slight_smile:

1 Like

thanks for the pointer but I’m trying to do apple to apple comparison, in terms of readability. Like we shouldn’t have 5x the code than numba right?

1 Like