Well you have to define a ComplexSIMD
. It would be good that this type of reasoning is accessible to newbies, or best hidden to them.
In the Mojo case thatβs part of a library, which is loaded on top of the code. That can go to a library in Julia as well. But what I find more interesting here is that the reasoning and code type applies to any composite type of a similar structure, thus we can use that to optimize other code completely unrelated to calculations with complex numbers.
Isnβt this almost like the typical βarray-of-structs to struct-of-arraysβ transformation, just with the added help to the compiler of pre-chunking to length 4? I guess you might get better memory locality if not all im fields are in one separate vector and all re in the other.
julia> @benchmark compute_mandelbrot!($result)
BenchmarkTools.Trial: 68 samples with 1 evaluation.
Range (min β¦ max): 14.030 ms β¦ 15.901 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 14.824 ms β GC (median): 0.00%
Time (mean Β± Ο): 14.836 ms Β± 576.959 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β
β ββ
ββββββββ
ββ
ββββββββββ
ββββ
β
ββββ
β
ββ
β
ββββ
βββββ
ββββ
β
ββββ
β
βββ
ββββ
β
β
14 ms Histogram: frequency by time 15.9 ms <
Memory estimate: 18.69 KiB, allocs estimate: 177.
julia> @benchmark compute_mandelbrot_turbo!($result)
BenchmarkTools.Trial: 404 samples with 1 evaluation.
Range (min β¦ max): 2.460 ms β¦ 2.946 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.462 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.477 ms Β± 77.247 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
julia> versioninfo()
Julia Version 1.9.3-DEV.0
Commit 6fc1be04ee (2023-07-06 14:55 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: 28 Γ Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
Threads: 28 on 28 virtual cores
Code:
using LoopVectorization, VectorizationBase
# using Plots
const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200
@inline function mandelbrot_kernel(c)
z = c
for i = 1:MAX_ITERS
z = z * z + c
if abs2(z) > 4
return i
end
end
return MAX_ITERS
end
@inline function mandelbrot_kernel(r::Vec{W}, i::Vec{W}) where {W}
z = c = complex(r,i)
inactive = VectorizationBase.Mask(VectorizationBase.zero_mask(Val(W)))
imask = VectorizationBase.vbroadcast(Val(W), 0)
for i = 1:MAX_ITERS
z = muladd(z, z, c)
imask = ifelse(inactive, imask, i)
inactive |= abs2(z) > 4
VectorizationBase.vall(inactive) && return imask
end
return imask
end
@inline function mandelbrot_kernel(r::VecUnroll, i::VecUnroll)
VectorizationBase.VecUnroll(fmap(mandelbrot_kernel, VectorizationBase.data(r),VectorizationBase.data(i)))
end
function compute_mandelbrot!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
Threads.@threads for j = 1:yn
for i = 1:xn
x = x_range[i]
y = y_range[j]
result[j, i] = mandelbrot_kernel(complex(x, y))
end
end
return result
end
compute_mandelbrot() = compute_mandelbrot!(zeros(yn, xn))
function compute_mandelbrot_turbo!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@tturbo for j = 1:yn
for i = 1:xn
x = x_range[i]
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
return result
end
compute_mandelbrot_turbo() = compute_mandelbrot_turbo!(zeros(yn, xn))
result = zeros(yn, xn);
@benchmark compute_mandelbrot!($result)
@benchmark compute_mandelbrot_turbo!($result)
EDIT:
I get
julia> @benchmark mandelbrot($result, $x_range, $y_range, Val(4))
BenchmarkTools.Trial: 312 samples with 1 evaluation.
Range (min β¦ max): 3.094 ms β¦ 7.088 ms β GC (min β¦ max): 0.00% β¦ 52.95%
Time (median): 3.146 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.199 ms Β± 438.872 ΞΌs β GC (mean Β± Ο): 1.49% Β± 5.96%
ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.09 ms Histogram: log(frequency) by time 6.98 ms <
Memory estimate: 593.08 KiB, allocs estimate: 4843.
from @sdanischβs example.
I also see better perf with more threads, but LV is limiting itself to 1 per core and not using hyperthreads.
New best:
julia> @benchmark compute_mandelbrot_polyturbo!($result)
BenchmarkTools.Trial: 652 samples with 1 evaluation.
Range (min β¦ max): 1.494 ms β¦ 4.784 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.520 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.533 ms Β± 147.023 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ βββββββ β β
ββββββββββββββββββββββββββββββββββββ
ββ
ββββ
βββ
βββ
βββββββββββ β
1.49 ms Histogram: log(frequency) by time 1.6 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
code
using LoopVectorization, VectorizationBase, Polyester
# using Plots
const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200
@inline function mandelbrot_kernel(c)
z = c
for i = 1:MAX_ITERS
z = z * z + c
if abs2(z) > 4
return i
end
end
return MAX_ITERS
end
@inline function mandelbrot_kernel(r, i::Vec{W}) where {W}
z = c = complex(r,i)
inactive = VectorizationBase.Mask(VectorizationBase.zero_mask(Val(W)))
imask = VectorizationBase.vbroadcast(Val(W), 0)
for i = 1:MAX_ITERS
z = muladd(z, z, c)
imask = ifelse(inactive, imask, i)
inactive |= abs2(z) > 4
VectorizationBase.vall(inactive) && return imask
end
return imask
end
@inline function mandelbrot_kernel(r::VecUnroll, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::Vec, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, r, VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::VecUnroll, i::Vec)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),i))
end
function compute_mandelbrot!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
Threads.@threads for j = 1:yn
for i = 1:xn
x = x_range[i]
y = y_range[j]
result[j, i] = mandelbrot_kernel(complex(x, y))
end
end
return result
end
compute_mandelbrot() = compute_mandelbrot!(zeros(yn, xn))
function compute_mandelbrot_turbo!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@tturbo for j = 1:yn
for i = 1:xn
x = x_range[i]
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
return result
end
function compute_mandelbrot_polyturbo!(result)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@batch per=thread for i = 1:xn
x = x_range[i]
@turbo for j = 1:yn
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
return result
end
compute_mandelbrot_turbo() = compute_mandelbrot_turbo!(zeros(yn, xn))
compute_mandelbrot_polyturbo() = compute_mandelbrot_polyturbo!(zeros(yn, xn))
result = zeros(yn, xn);
@benchmark compute_mandelbrot!($result)
@benchmark compute_mandelbrot_turbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result)
so thatβs 3x faster on top of 8x faster (than original Julia solution)? Which means this should be 6-7x faster than Mojo?
I was curious and wanted to try the benchmark out myself β turns out you can only install Mojo on Ubuntu currently
β¦but Mojo is new and not yet so optimised compared to a long-standing language as Juliaβ¦
Interesting!
On my PC, my version is still fastestβ¦
I get:
compute_mandelbrot!: 18.218 ms (72 allocations: 9.03 KiB)
compute_mandelbrot_turbo!: 4.224 ms (0 allocations: 0 bytes)
compute_mandelbrot_polyturbo!: 3.799 ms (0 allocations: 0 bytes)
My version: 3.024 ms (4824 allocations: 682.48 KiB)
Although Iβm surprised that my PC seems to be pretty slow in comparison to the other posters hereβ¦
I have a Ryzen 9 7900X 12 core, which I thought is one of the faster CPUs on the market.
Edit:
The original version seems to benefit much more from 24 threadsβ¦
The first values where with 12 threads, while this is with 24:
@btime compute_mandelbrot!($result);
9.763 ms (142 allocations: 17.97 KiB)
@btime compute_mandelbrot_turbo!($result);
4.617 ms (0 allocations: 0 bytes)
@btime compute_mandelbrot_polyturbo!($result);
2.866 ms (0 allocations: 0 bytes)
@btime mandelbrot(result, x_range, y_range, Val(4));
2.372 ms (4836 allocations: 682.86 KiB)
Interesting. This is the fastest for me so far (a modification of your version to use @turbo
):
julia> @benchmark mandelbrot_turbo!($result, Val(8))
BenchmarkTools.Trial: 804 samples with 1 evaluation.
Range (min β¦ max): 988.264 ΞΌs β¦ 4.945 ms β GC (min β¦ max): 0.00% β¦ 72.48%
Time (median): 1.187 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.240 ms Β± 425.295 ΞΌs β GC (mean Β± Ο): 3.74% Β± 8.22%
βββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
988 ΞΌs Histogram: log(frequency) by time 4.77 ms <
Memory estimate: 610.55 KiB, allocs estimate: 5402.
It uses the same mandelbrot_kernel
definition from above, and:
function inner_turbo(result, x_range, y_range, x, ::Val{N}) where N
ci = x_range[x]
@turbo for j = eachindex(y_range)
y = y_range[j]
result[j, x] = mandelbrot_kernel(x, y)
end
end
function mandelbrot_turbo!(result, x_range, y_range, N)
@sync for x in eachindex(x_range)
# Threads.@spawn allows dynamic scheduling instead of static scheduling
# of Threads.@threads macro.
# See <https://github.com/JuliaLang/julia/issues/21017>.
Threads.@spawn @inbounds begin
inner_turbo(result, x_range, y_range, x, N)
end
end
return result
end
How does this do for you?
Mine is a 9940X, a 14 core with AVX512. That is also why I used Val(8)
above.
I donβt have an ubuntu computer at the moment.
@jling, relative results often differ from one computer to another (like how mine differed from @sdanisch), so I wouldnβt be confident in relative performance versus Mojo without trying it on the same computer.
Seems like this really likes the load balancing from spamming all the tasks. I guess there is substantial variation in times.
Iterating over y
in a random order also seems to help me (perm is a permutation vector):
julia> @benchmark compute_mandelbrot_polyturbo!($result, $perm)
BenchmarkTools.Trial: 964 samples with 1 evaluation.
Range (min β¦ max): 859.147 ΞΌs β¦ 5.992 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 866.572 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.035 ms Β± 443.858 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββ
β
ββββββ
ββ
βββββββββ
βββ
βββ
βββ
ββββ
βββββ
βββββ
βββββββββββ β
859 ΞΌs Histogram: log(frequency) by time 2.81 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Code:
function compute_mandelbrot_polyturbo!(result, perm)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@batch per=thread for _i = 1:xn
i = perm[_i]
x = x_range[i]
@turbo for j = 1:yn
y = y_range[j]
result[j, i] = mandelbrot_kernel(x, y)
end
end
return result
end
using Random
perm = randperm(yn);
@benchmark compute_mandelbrot_polyturbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result, $perm)
On a related note, one thing that might make the code faster is to use AVX vectors that are rectangular rather than a line (since the closer the points in the vector are, the more correlated their number of iters will be)
In this code I donβt see any @simd
or other similar macro⦠Just @inbounds
- is that enough to tell Julia to automatically apply SIMD here?
Iβm just trying to understand how this example works, thanks for creating it!
The compiler is just being smart enough to turn broadcasted NTuple instructions into SIMD here. LLVM is good at itβs job.
The @simd
macro is not particularly useful. Most cases where I see people use it, itβs superfluous because theyβre either applying it to a loop that has no hope of being vectorized, or theyβre applying it to a loop that would already get automatically vectorized.
I tried to install mojo on a ubuntu 20.04 server but the installation was still going on after a couple of hours so I stopped it. I think itβs best to wait a bit for this to iron out.
If this would be about efficiently computing the Mandelbrot set, rather than a comparison with Mojo, then using a different algorithm would give even more speed. Checking for the cardioid or the bulb for starters, then period 1 checking, then Mariani-Silver followed by border tracing seems a good way to go. Iβm not a Julia expert, so I canβt contribute code, but Wikipedia has a great overview of possible algorithms Plotting algorithms for the Mandelbrot set - Wikipedia and Rico Mariani explains his algorithm here: The Mariani-Silver Algorithm for Drawing the Mandelbrot Set | by Rico Mariani | Medium
To expand on what others have said, @simd
does 1 thing: it allows reassociating instructions in reductions. Normally vectorizing something like a sum
would be illegal, because that requires changing the order of your additions.
@simd
makes that legal.
@simd ivdep
does a little more, promising to LLVM that the loop iterations may be executed in parallel. When vectorization failed because of possible aliasing, then @simd ivdep
helps.
Neither of those were a problem for @sdanischβs code, so neither @simd
or @simd ivdep
were needed.
Back in the day I had a screensaver for the BeOS that did border checking and splitting, it showed off the (then new) multiprocessing capabilities of BeOS. The algorithm was to start with a square in the plane, divide it in 4 quarters, check the edge pixels, if they were all the same color color the interior a constant color, if not, split the square into 4 squares and repeat until you were checking less than some small number of pixels and then just check all the pixels in the square (I think maybe 8x8 but in modern hardware it might be worth it to be doing like 16x16 or 32x32 or something, at the time the machine was clocked at about 200Mhz or less).
The process operated with a queue (equivalent of a Julia Channel) then N worker threads one for each core. Each worker thread would read a rectangle to check from the queue, check the rectangle, either fill it with a constant color or split it and push the new rectangles into the work queue, or just fill the whole thing checking each pixel when itβs small enough. Drawing would be done in one pixel buffer, and then a separate worker thread would occasionally blit that pixel buffer to the screensaver windowβ¦ so youβd see the algorithm proceed in real-time.
It was called MandelSaver, this would have been about 1995 or so. Good times.
Anyway, the point of all that was to say that the real advantage to Julia isnβt that you can extract every tiny bit of raw CPU speed out of the CPU, itβs that you can extract efficiencies of expressing more complex algorithms in a way your brain can still understand without sacrificing most of the speed. If you can get 80% of the raw speed, but it lets you use a 200x faster algorithm thatβs a huge win. Sometimes raw speed is needed of course, when youβre already using the fastest known algorithm etc.
Am I right in thinking that the main issue for speed here is the branch abs2(z)>4
?
In this case, could we make a package to handle this branching in parallel loops?
Yes, the main issue is that the branch exits the loop.
Also, that the optimal loop to SIMD is not the inner most loop. LLVM is really bad whenever thatβs the case, so it needs help.
SIMD.jl and LoopVectorization.jl can both help there, but LoopVectorization.jl also doesnβt handle early-stopping loops like that branch, which is why it needs a similar workaround for mandelbrot_kernel
.