This was actually concerning Matlab syntax. I recommend avoiding [0:N]
in Matlab.
But one of my points, and as I show, JADekkerâs allocating code is faster than a zero-allocation sum
call, so general principles arenât universal. While reducing allocations does save work, allocating dense values can enable performance benefits; base Julia acknowledges this by making indexing syntax do the copying getindex
instead view
. In this case, parallelization by SIMD or multiple threads/cores likely explains the initial performance gap. Itâs not exactly documented, but I have on good authority that MATLABâs sum
, cos
, and exp
has been multithreaded for almost 2 decades.
Automatically multithreading does have its complications. For one, you often canât multithread a scalar operation, so itâs really the higher-order map or reduction you want to multithread. For another, you donât really want to multithread an already multithreaded operation because the multiplied threading overhead wonât be compensated by more cores. Thatâs partially why base MATLAB has inherently vectorized operations; itâs more readable and less complicated to multithread separated matrix operations. Julia however doesnât have many of those, even the similar-looking dotted operations syntax lowers to a higher-order function calling scalar operations, similar to MATLABâs bsxfun. Similar readability to dotted operations syntax can also be achieved by macros.
Can we multithread easily in Julia? Letâs give ThreadsX.jl a shot, as yolhan_mannes suggested:
julia> using ThreadsX; Threads.nthreads()
8
julia> @time f_noalloc(10_000) # same as my earlier implementation
3.937923 seconds
julia> @time f_threaded(10_000) # switches _noalloc code to ThreadsX.sum, allocates
1.191136 seconds (5.08 M allocations: 433.655 MiB, 15.10% gc time)
julia> @time f_noalloc_cispi(10_000)
2.064730 seconds
julia> @time f_cispi_threaded(10_000)
0.945237 seconds (5.08 M allocations: 433.655 MiB, 19.76% gc time)
Changing sum
to ThreadsX.sum
gave a 3.31x or 2.18x speedup on 8 available Intel cores depending on whether I used cispi
, on par with the initial observed discrepancy between MATLAB and Julia and far outpacing the speed costs of the introduced allocations. By comparison, M2 Max has 8 performance cores and 4 efficiency cores, and MATLAB users have suggested all 12 are used. In the original example, though, N=100_000
, so letâs also check that:
julia> @time f_cispi_threaded(100_000)
62.908176 seconds (50.80 M allocations: 4.236 GiB, 3.99% gc time)
Yes, this is what I referred to regarding other âsubtitlesâ. But with some extra work you can both avoid allocations and get the other optimizations.
Are there any optimizations that require doing two passes?
I wonder if ThreadsX can cut down on the millions of allocations, 80ish bytes on average. I donât expect zero because we need something to hold the intermediate values of parallel reduction, but I naively expect there to be possible improvement.
There are some programs where multiple passes help a lot, but in this case where we can fuse the elementwise operations into one kernel, Iâm leaning to no. I may be wrong, but I also donât expect that kernel being computed alongside the reduction step to lose performance. AFAIK MATLAB is technically free to JIT this, but I donât know how to find out if the JIT is capable of that, and MATLAB user discussions donât seem to know much more. Not gonna buy it again to dig further.
In a real case you wouldnât use threadX inside the loop but rather on the outer most loop. This was just for benchmark purposes
Oh thatâs a good catch, multithreading an inner loop would let the outer loop multiply the threads, even if the separate iterations stop them from competing for cores. Had to make a change to ThreadsX.foreach
, which would be a problem if K
was reassigned but itâs not:
julia> function fe_cispi_threaded(N)
ThreadsX.foreach(1:N) do K
sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
end
end
fe_cispi_threaded (generic function with 1 method)
julia> @time fe_cispi_threaded(10_000) # first run compiles too
0.502951 seconds (67.76 k allocations: 3.455 MiB, 74.38% compilation time)
julia> @time fe_cispi_threaded(10_000)
0.456593 seconds (395 allocations: 42.234 KiB)
julia> @time fe_cispi_threaded(100_000)
55.192057 seconds (395 allocations: 42.234 KiB)
Allocations have a worse impact in multithreading, so the gains from allocating are more nuanced. The no-allocation-sum
implementation does outpace the very-allocating all-dotted implementation if we multithread the outer loop (and introduce those allocations). Hoisting the vector allocations out of the loop by preallocation regains the speed advantage and returns the allocated memory scaling to the expected 10x instead of 100x.
...
julia> @time fe_dots_threaded(10_000) # ThreadsX.foreach JADekker's f_dots
0.757834 seconds (30.39 k allocations: 1.492 GiB, 14.15% gc time)
julia> @time fe_dots_threaded(100_000)
113.574487 seconds (300.39 k allocations: 149.024 GiB, 15.14% gc time)
...
julia> @time fe_dots_and_prealloc_cispi_threaded(10_000) # ThreadsX f_dots_and_prealloc_cispi
0.322095 seconds (402 allocations: 277.920 KiB)
julia> @time fe_dots_and_prealloc_cispi_threaded(100_000)
37.018206 seconds (402 allocations: 2.331 MiB, 0.00% gc time)
For threading on this level probably ThreadsX does not have a great advantage relative to Base.@threads
(or OhMyThreads.jl alternatives, fwiw).
Perhaps better to avoid the allocations, and maybe pull the divisions out of the loop?
Nâ»Âč = 1/N
sum(n -> cospi(2n * Nâ»Âč) * cispi(-2n * K * Nâ»Âč), 0:N) * Nâ»Âč
? (In a function, of course, not with globals.)
Probably, I mostly stuck to ThreadsX
to limit the changes for comparisonâs sake because Iâm not actually sure how ThreadsX
does its multithreading. For example, Polyester.jl does multithreading more cheaply than Base.@threads
, but I donât use it and wouldnât know if it could apply to this case. To add to the point about MATLABâs approach being less complicated, it let users write straightforward matrix operations and abstracted away the mental load of choosing multithreading approaches and where to put them, at the cost of less control. But that praise/criticism of MATLAB is nothing new.
Answering my own question:
julia> using ThreadsX, OhMyThreads, Base.Threads
julia> function fe_cispi_threaded(N)
ThreadsX.foreach(1:N) do K
sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
end
end
fe_cispi_threaded (generic function with 1 method)
julia> function fe_cispi_threaded2(N)
@threads for K in 1:N
sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
end
end
fe_cispi_threaded2 (generic function with 1 method)
julia> function fe_cispi_threaded3(N)
tforeach(1:N) do K
sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
end
end
fe_cispi_threaded3 (generic function with 1 method)
julia> @time fe_cispi_threaded(10_000)
0.226913 seconds (68.58 k allocations: 3.492 MiB, 114.83% compilation time)
julia> @time fe_cispi_threaded2(10_000)
0.274059 seconds (68.97 k allocations: 3.501 MiB, 116.73% compilation time)
julia> @time fe_cispi_threaded3(10_000)
0.293808 seconds (22.24 k allocations: 1.066 MiB, 89.68% compilation time)
julia> @time fe_cispi_threaded(100_000)
20.061630 seconds (491 allocations: 44.734 KiB)
julia> @time fe_cispi_threaded2(100_000)
24.175890 seconds (52 allocations: 5.297 KiB)
julia> @time fe_cispi_threaded3(100_000)
23.827874 seconds (76 allocations: 5.906 KiB)
ThreadsX
is somewhat faster, and allocated somewhat more.
Why cispi
but not cospi
?
That I simply copied from above.
I know that mulitthreading throws off the compilation time count but itâs still jarring to see.
I think there was a tacit agreement not to change that part from the original MATLAB/Julia examples, while there was a benchmark indicating that MATLAB was doing something like cis
for exp
with imaginary input.
In the grand scheme, the choice of these functions wasn't important because the gains are too modest to explain the initial discrepancy.
julia> @btime cos(pi*$3.4)
7.700 ns (0 allocations: 0 bytes)
-0.30901699437494784
julia> @btime cospi($3.4)
7.200 ns (0 allocations: 0 bytes)
-0.3090169943749477
julia> 7.7/7.2 # only 7% gain
1.0694444444444444
julia> @btime exp(im*pi*$3.4)
13.814 ns (0 allocations: 0 bytes)
-0.30901699437494784 - 0.9510565162951534im
julia> @btime cis(pi*$3.4)
10.711 ns (0 allocations: 0 bytes)
-0.30901699437494784 - 0.9510565162951534im
julia> @btime cispi($3.4)
10.200 ns (0 allocations: 0 bytes)
-0.3090169943749477 - 0.9510565162951535im
julia> 13.814/10.711, 13.814/10.2 # 29% or 35% gain
(1.289702175333769, 1.3543137254901962)
Multithreading is the likely difference because its improvements to the Julia code are within the same ballpark. From what I could tell on the Mathworks forums, how MATLAB multithreads is mostly a black box to MATLAB users, too.
It was very tacit, and stuck out like a sore thumb through multiple posts, to the point where I had to know what was going on. Of course, itâs not significant for performance, but I think it merited mention, because it was getting very noticeable.
But also, according to the same logic, it should be cis
not cispi
.
(For some of us, these sort of things are like the sound of a scratched record, while for others it barely registers.)