Changing algorithm isn’t fair for comparison to Numba and JAX and from my benchmark it didn’t make much difference for my memory speed
Complex
will use muladd
with @fastmath
soon. make `@fastmath` `Complex` multiplication use `muladd` by oscardssmith · Pull Request #45676 · JuliaLang/julia · GitHub
can we let it use fma
by default without @fastmath
? I don’t know exactly the nuance, but muladd
doc says sometimes it’s implemented as fma
.
I also remember that if you use muladd
directly instead of fma
, if the hardware doesn’t have it, it would be very slow?
julia> @benchmark run4(2000,3000)
BenchmarkTools.Trial: 187 samples with 1 evaluation.
Range (min … max): 25.048 ms … 29.405 ms ┊ GC (min … max): 0.00% … 2.95%
Time (median): 27.816 ms ┊ GC (median): 0.00%
Time (mean ± σ): 26.838 ms ± 1.716 ms ┊ GC (mean ± σ): 2.25% ± 3.49%
▁█ ▅ ▄ ▅
██▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▄█▆▅▁▁▄▁▁▁▁▁▁▁▁▄█ ▄
25 ms Histogram: log(frequency) by time 29.3 ms <
Memory estimate: 22.89 MiB, allocs estimate: 2.
julia> @benchmark run_julia(2000,3000)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 273.904 ms … 285.103 ms ┊ GC (min … max): 1.36% … 0.26%
Time (median): 276.711 ms ┊ GC (median): 1.35%
Time (mean ± σ): 277.268 ms ± 2.543 ms ┊ GC (mean ± σ): 1.30% ± 0.25%
▃ ▃█ █ ▃
▇▁▁▁▁▁▁▁▁█▇██▁▁▇▁▇█▇█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
274 ms Histogram: frequency by time 285 ms <
Memory estimate: 68.66 MiB, allocs estimate: 4.
This is single threaded. Code:
julia> macro vp(expr)
nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
if expr.head != :for
error("Syntax error: loopinfo needs a for loop")
end
push!(expr.args[2].args, Expr(:loopinfo, nodes))
return esc(expr)
end
julia> function run4(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
fractal = fill(Int32(20), height, width)
@inbounds @fastmath for w in 1:width
@vp for h in 1:height
z_re = _c_re = x[w]
z_im = _c_im = y[h]
m = true
Base.Cartesian.@nexprs 20 i -> begin
z_re,z_im = _c_re + z_re*z_re - z_im*z_im, _c_im + 2*z_re*z_im
az4 = (z_re*z_re + z_im*z_im) > 4f0
fractal[h, w] = ifelse(m & az4,i%Int32,fractal[h,w])
m &= (!az4)
end
end
end
return fractal
end
This also works quite well with awkward combinations w/ respect to vector length
julia> @benchmark run_julia(10,10)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.316 μs … 189.292 μs ┊ GC (min … max): 0.00% … 95.85%
Time (median): 3.414 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.580 μs ± 2.479 μs ┊ GC (mean ± σ): 0.95% ± 1.36%
▅ ▆█ ▁
▅█▆██▇▄▃▂▂▂▂▂▂▄▆▃▃▂▃▇█▆▄▃▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
3.32 μs Histogram: frequency by time 4.46 μs <
Memory estimate: 1.36 KiB, allocs estimate: 2.
julia> @benchmark run4(10,10)
BenchmarkTools.Trial: 10000 samples with 57 evaluations.
Range (min … max): 896.930 ns … 23.460 μs ┊ GC (min … max): 0.00% … 92.93%
Time (median): 905.561 ns ┊ GC (median): 0.00%
Time (mean ± σ): 936.896 ns ± 423.935 ns ┊ GC (mean ± σ): 0.97% ± 2.06%
▅██▅▄▃ ▁▃▃▁ ▃▃▁ ▂▄▂▁▁▁ ▂
████████▇█████▆▅▅▃▃▃▃▁▃▄▁▆█▆▄█████████████▇▇▅▆▄▅▅▅▃▄▃▄▅▅▄▆▆▇▇ █
897 ns Histogram: log(frequency) by time 1.12 μs <
Memory estimate: 496 bytes, allocs estimate: 1.
julia> @benchmark run_julia(16,16)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min … max): 8.377 μs … 960.083 μs ┊ GC (min … max): 0.00% … 93.93%
Time (median): 8.663 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.933 μs ± 9.612 μs ┊ GC (mean ± σ): 1.01% ± 0.94%
█▃ ▁▁
▁▂████▅▇██▆▄▃▂▂▂▂▂▄▆▇▆▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁ ▂
8.38 μs Histogram: frequency by time 9.96 μs <
Memory estimate: 3.19 KiB, allocs estimate: 2.
julia> @benchmark run4(16,16)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.236 μs … 147.251 μs ┊ GC (min … max): 0.00% … 97.03%
Time (median): 1.306 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.438 μs ± 1.935 μs ┊ GC (mean ± σ): 1.86% ± 1.38%
▅█▆▃▂▂▁ ▅▄▁ ▅▆▃▂ ▃▃▂▁ ▂
▄▁▁███████▇▅▄▁▁▁███▆▇▅▇███████████████▇▇▆▆▆▆▆▅▆▅▆▆▆▆▅▅▄▅▃▄▅ █
1.24 μs Histogram: log(frequency) by time 2 μs <
Memory estimate: 1.06 KiB, allocs estimate: 1.
This is with @fastmath
so we can use fma
. The reason I used muladd
instead of fma
is so that hardware without fma
won’t be slow.
ok so I remembered backward, we should always use muladd
instead of fma
well that is cheating, in reality what if they bump cutoff to 200? I mean this is just compiler “cheating”
I learned just today that it is now possible to use multithreading with numba
, by using the decorator option @jit(parallel = True)
an by replacing the range
function in the loops with numba.prange
. I wonder how this would compare against julia’s @Threads.threads
The compiler is not smart enough to figure out what to do. Ideally, we wouldn’t unroll that loop at all.
But if we leave it as a loop, the compiler fails to optimize and vectorize outer loops.
The dependencies are also too complicated for old LoopVectorization to understand, but the rewrite should be able to optimize this correctly.
There are times when you really need an fma
, but if you just care about the performance muladd
is slightly safer (that said, other than M1+rosetta, pretty much any modern computer has fma
)
right, also that’s beyond the point, benchmarking single-thread performance is perfectly fine
so to correct my question: can we use muladd
for the case where @fastmath
is off? because we’re also encouraging people to not use @fastmath
The one downside with doing so is that the current algorithm does a better job of correctly returning 0.0
for the real part when the real part of the answer should be zero. I’m not sure overall how I would feel about changing it.
tldr so far: fma
/muladd
helps and we can probably beat Numba
with reasonable carefulness + direct translation and not using @fastmath
But can we beat JAX? which seems to be much faster
Are you willing to use SIMD.jl?
yeah, I tried to think of a way to use LoopVectorization but couldn’t
BenchmarksGame.jl/mandelbrot-fast.jl at master · JuliaPerf/BenchmarksGame.jl · GitHub might have something
I already commented on this. Both benchmarks should be fixed. That allocation is both ugly, confusing and pointless. It’s an abomination, even if it doesn’t impact performance much.
you can tell that to Numpy because entire Numpy usage dictates that there will be a lot of intermediate allocation, and also it doesn’t help we can’t beat their “abomination” version
thanks for the pointer but I’m trying to do apple to apple comparison, in terms of readability. Like we shouldn’t have 5x the code than numba right?