Hard to beat Numba / JAX loop for generating Mandelbrot

jling · June 14, 2022, 1:14pm

Changing algorithm isn’t fair for comparison to Numba and JAX and from my benchmark it didn’t make much difference for my memory speed

Oscar_Smith · June 14, 2022, 1:40pm

Complex will use muladd with @fastmath soon. make `@fastmath` `Complex` multiplication use `muladd` by oscardssmith · Pull Request #45676 · JuliaLang/julia · GitHub

jling · June 14, 2022, 1:42pm

can we let it use fma by default without @fastmath? I don’t know exactly the nuance, but muladd doc says sometimes it’s implemented as fma.

I also remember that if you use muladd directly instead of fma, if the hardware doesn’t have it, it would be very slow?

Elrod · June 14, 2022, 1:44pm

julia> @benchmark run4(2000,3000)
BenchmarkTools.Trial: 187 samples with 1 evaluation.
 Range (min … max):  25.048 ms … 29.405 ms  ┊ GC (min … max): 0.00% … 2.95%
 Time  (median):     27.816 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.838 ms ±  1.716 ms  ┊ GC (mean ± σ):  2.25% ± 3.49%

  ▁█                                    ▅    ▄              ▅
  ██▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▄█▆▅▁▁▄▁▁▁▁▁▁▁▁▄█ ▄
  25 ms        Histogram: log(frequency) by time      29.3 ms <

 Memory estimate: 22.89 MiB, allocs estimate: 2.

julia> @benchmark run_julia(2000,3000)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  273.904 ms … 285.103 ms  ┊ GC (min … max): 1.36% … 0.26%
 Time  (median):     276.711 ms               ┊ GC (median):    1.35%
 Time  (mean ± σ):   277.268 ms ±   2.543 ms  ┊ GC (mean ± σ):  1.30% ± 0.25%

           ▃ ▃█     █ ▃
  ▇▁▁▁▁▁▁▁▁█▇██▁▁▇▁▇█▇█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  274 ms           Histogram: frequency by time          285 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 4.

This is single threaded. Code:

julia> macro vp(expr)
            nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
            if expr.head != :for
                error("Syntax error: loopinfo needs a for loop")
            end
            push!(expr.args[2].args, Expr(:loopinfo, nodes))
            return esc(expr)
       end

julia> function run4(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           fractal = fill(Int32(20), height, width)
           @inbounds @fastmath for w in 1:width
               @vp for h in 1:height
                   z_re = _c_re = x[w]
                   z_im = _c_im = y[h]
                   m = true
                   Base.Cartesian.@nexprs 20 i -> begin
                       z_re,z_im = _c_re + z_re*z_re - z_im*z_im, _c_im + 2*z_re*z_im
                       az4 = (z_re*z_re + z_im*z_im) > 4f0
                       fractal[h, w] = ifelse(m & az4,i%Int32,fractal[h,w])
                       m &= (!az4)
                   end
               end
           end
           return fractal
       end

This also works quite well with awkward combinations w/ respect to vector length

julia> @benchmark run_julia(10,10)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.316 μs … 189.292 μs  ┊ GC (min … max): 0.00% … 95.85%
 Time  (median):     3.414 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.580 μs ±   2.479 μs  ┊ GC (mean ± σ):  0.95% ±  1.36%

   ▅ ▆█                ▁
  ▅█▆██▇▄▃▂▂▂▂▂▂▄▆▃▃▂▃▇█▆▄▃▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  3.32 μs         Histogram: frequency by time        4.46 μs <

 Memory estimate: 1.36 KiB, allocs estimate: 2.

julia> @benchmark run4(10,10)
BenchmarkTools.Trial: 10000 samples with 57 evaluations.
 Range (min … max):  896.930 ns …  23.460 μs  ┊ GC (min … max): 0.00% … 92.93%
 Time  (median):     905.561 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   936.896 ns ± 423.935 ns  ┊ GC (mean ± σ):  0.97% ±  2.06%

  ▅██▅▄▃   ▁▃▃▁                ▃▃▁ ▂▄▂▁▁▁                       ▂
  ████████▇█████▆▅▅▃▃▃▃▁▃▄▁▆█▆▄█████████████▇▇▅▆▄▅▅▅▃▄▃▄▅▅▄▆▆▇▇ █
  897 ns        Histogram: log(frequency) by time       1.12 μs <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark run_julia(16,16)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.377 μs … 960.083 μs  ┊ GC (min … max): 0.00% … 93.93%
 Time  (median):     8.663 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.933 μs ±   9.612 μs  ┊ GC (mean ± σ):  1.01% ±  0.94%

     █▃   ▁▁
  ▁▂████▅▇██▆▄▃▂▂▂▂▂▄▆▇▆▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁ ▂
  8.38 μs         Histogram: frequency by time        9.96 μs <

 Memory estimate: 3.19 KiB, allocs estimate: 2.

julia> @benchmark run4(16,16)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.236 μs … 147.251 μs  ┊ GC (min … max): 0.00% … 97.03%
 Time  (median):     1.306 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.438 μs ±   1.935 μs  ┊ GC (mean ± σ):  1.86% ±  1.38%

     ▅█▆▃▂▂▁      ▅▄▁    ▅▆▃▂   ▃▃▂▁                          ▂
  ▄▁▁███████▇▅▄▁▁▁███▆▇▅▇███████████████▇▇▆▆▆▆▆▅▆▅▆▆▆▆▅▅▄▅▃▄▅ █
  1.24 μs      Histogram: log(frequency) by time         2 μs <

 Memory estimate: 1.06 KiB, allocs estimate: 1.

Oscar_Smith · June 14, 2022, 1:46pm

This is with @fastmath so we can use fma. The reason I used muladd instead of fma is so that hardware without fma won’t be slow.

jling · June 14, 2022, 1:46pm

ok so I remembered backward, we should always use muladd instead of fma

jling · June 14, 2022, 1:47pm

well that is cheating, in reality what if they bump cutoff to 200? I mean this is just compiler “cheating”

Bruno_Amorim · June 14, 2022, 1:48pm

I learned just today that it is now possible to use multithreading with numba, by using the decorator option @jit(parallel = True) an by replacing the range function in the loops with numba.prange. I wonder how this would compare against julia’s @Threads.threads

Elrod · June 14, 2022, 1:49pm

The compiler is not smart enough to figure out what to do. Ideally, we wouldn’t unroll that loop at all.
But if we leave it as a loop, the compiler fails to optimize and vectorize outer loops.

The dependencies are also too complicated for old LoopVectorization to understand, but the rewrite should be able to optimize this correctly.

Oscar_Smith · June 14, 2022, 1:49pm

There are times when you really need an fma, but if you just care about the performance muladd is slightly safer (that said, other than M1+rosetta, pretty much any modern computer has fma)

jling · June 14, 2022, 1:50pm

right, also that’s beyond the point, benchmarking single-thread performance is perfectly fine

jling · June 14, 2022, 1:51pm

so to correct my question: can we use muladd for the case where @fastmath is off? because we’re also encouraging people to not use @fastmath

Oscar_Smith · June 14, 2022, 1:56pm

The one downside with doing so is that the current algorithm does a better job of correctly returning 0.0 for the real part when the real part of the answer should be zero. I’m not sure overall how I would feel about changing it.

jling · June 14, 2022, 2:03pm

tldr so far: fma/muladd helps and we can probably beat Numba with reasonable carefulness + direct translation and not using @fastmath

But can we beat JAX? which seems to be much faster

Elrod · June 14, 2022, 2:07pm

Are you willing to use SIMD.jl?

jling · June 14, 2022, 2:09pm

yeah, I tried to think of a way to use LoopVectorization but couldn’t

kristoffer.carlsson · June 14, 2022, 2:22pm

BenchmarksGame.jl/mandelbrot-fast.jl at master · JuliaPerf/BenchmarksGame.jl · GitHub might have something

DNF · June 14, 2022, 2:30pm

I already commented on this. Both benchmarks should be fixed. That allocation is both ugly, confusing and pointless. It’s an abomination, even if it doesn’t impact performance much.

jling · June 14, 2022, 2:31pm

you can tell that to Numpy because entire Numpy usage dictates that there will be a lot of intermediate allocation, and also it doesn’t help we can’t beat their “abomination” version

jling · June 14, 2022, 2:33pm

thanks for the pointer but I’m trying to do apple to apple comparison, in terms of readability. Like we shouldn’t have 5x the code than numba right?

Topic		Replies	Views
Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba) Performance performance	20	4027	August 22, 2021
Benchmarks: Julia vs Python+Numba Performance benchmark , python	4	5410	January 31, 2018
Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance? Performance benchmark , python , tullio , loopvectorization	56	6143	August 18, 2021
How to optimize the following code? Performance	7	970	January 11, 2020
Comparing Numba and Julia for a complex matrix computation Performance question	29	3391	July 31, 2021

Hard to beat Numba / JAX loop for generating Mandelbrot

Related topics