Simple speed comparison with Matlab

This was actually concerning Matlab syntax. I recommend avoiding [0:N] in Matlab.

But one of my points, and as I show, JADekker’s allocating code is faster than a zero-allocation sum call, so general principles aren’t universal. While reducing allocations does save work, allocating dense values can enable performance benefits; base Julia acknowledges this by making indexing syntax do the copying getindex instead view. In this case, parallelization by SIMD or multiple threads/cores likely explains the initial performance gap. It’s not exactly documented, but I have on good authority that MATLAB’s sum, cos, and exp has been multithreaded for almost 2 decades.

Automatically multithreading does have its complications. For one, you often can’t multithread a scalar operation, so it’s really the higher-order map or reduction you want to multithread. For another, you don’t really want to multithread an already multithreaded operation because the multiplied threading overhead won’t be compensated by more cores. That’s partially why base MATLAB has inherently vectorized operations; it’s more readable and less complicated to multithread separated matrix operations. Julia however doesn’t have many of those, even the similar-looking dotted operations syntax lowers to a higher-order function calling scalar operations, similar to MATLAB’s bsxfun. Similar readability to dotted operations syntax can also be achieved by macros.

Can we multithread easily in Julia? Let’s give ThreadsX.jl a shot, as yolhan_mannes suggested:

julia> using ThreadsX; Threads.nthreads()
8

julia> @time f_noalloc(10_000) # same as my earlier implementation
  3.937923 seconds

julia> @time f_threaded(10_000) # switches _noalloc code to ThreadsX.sum, allocates
  1.191136 seconds (5.08 M allocations: 433.655 MiB, 15.10% gc time)

julia> @time f_noalloc_cispi(10_000)
  2.064730 seconds

julia> @time f_cispi_threaded(10_000)
  0.945237 seconds (5.08 M allocations: 433.655 MiB, 19.76% gc time)

Changing sum to ThreadsX.sum gave a 3.31x or 2.18x speedup on 8 available Intel cores depending on whether I used cispi, on par with the initial observed discrepancy between MATLAB and Julia and far outpacing the speed costs of the introduced allocations. By comparison, M2 Max has 8 performance cores and 4 efficiency cores, and MATLAB users have suggested all 12 are used. In the original example, though, N=100_000, so let’s also check that:

julia> @time f_cispi_threaded(100_000)
 62.908176 seconds (50.80 M allocations: 4.236 GiB, 3.99% gc time)
3 Likes

Yes, this is what I referred to regarding other ‘subtitles’. But with some extra work you can both avoid allocations and get the other optimizations.

Are there any optimizations that require doing two passes?

I wonder if ThreadsX can cut down on the millions of allocations, 80ish bytes on average. I don’t expect zero because we need something to hold the intermediate values of parallel reduction, but I naively expect there to be possible improvement.

There are some programs where multiple passes help a lot, but in this case where we can fuse the elementwise operations into one kernel, I’m leaning to no. I may be wrong, but I also don’t expect that kernel being computed alongside the reduction step to lose performance. AFAIK MATLAB is technically free to JIT this, but I don’t know how to find out if the JIT is capable of that, and MATLAB user discussions don’t seem to know much more. Not gonna buy it again to dig further.

In a real case you wouldn’t use threadX inside the loop but rather on the outer most loop. This was just for benchmark purposes

2 Likes

Oh that’s a good catch, multithreading an inner loop would let the outer loop multiply the threads, even if the separate iterations stop them from competing for cores. Had to make a change to ThreadsX.foreach, which would be a problem if K was reassigned but it’s not:

julia> function fe_cispi_threaded(N)
         ThreadsX.foreach(1:N) do K
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
fe_cispi_threaded (generic function with 1 method)

julia> @time fe_cispi_threaded(10_000) # first run compiles too
  0.502951 seconds (67.76 k allocations: 3.455 MiB, 74.38% compilation time)

julia> @time fe_cispi_threaded(10_000)
  0.456593 seconds (395 allocations: 42.234 KiB)

julia> @time fe_cispi_threaded(100_000)
 55.192057 seconds (395 allocations: 42.234 KiB)

Allocations have a worse impact in multithreading, so the gains from allocating are more nuanced. The no-allocation-sum implementation does outpace the very-allocating all-dotted implementation if we multithread the outer loop (and introduce those allocations). Hoisting the vector allocations out of the loop by preallocation regains the speed advantage and returns the allocated memory scaling to the expected 10x instead of 100x.

...
julia> @time fe_dots_threaded(10_000) # ThreadsX.foreach JADekker's f_dots
  0.757834 seconds (30.39 k allocations: 1.492 GiB, 14.15% gc time)

julia> @time fe_dots_threaded(100_000)
113.574487 seconds (300.39 k allocations: 149.024 GiB, 15.14% gc time)
...
julia> @time fe_dots_and_prealloc_cispi_threaded(10_000) # ThreadsX f_dots_and_prealloc_cispi
  0.322095 seconds (402 allocations: 277.920 KiB)

julia> @time fe_dots_and_prealloc_cispi_threaded(100_000)
 37.018206 seconds (402 allocations: 2.331 MiB, 0.00% gc time)
1 Like

For threading on this level probably ThreadsX does not have a great advantage relative to Base.@threads (or OhMyThreads.jl alternatives, fwiw).

3 Likes

Perhaps better to avoid the allocations, and maybe pull the divisions out of the loop?

N⁻Âč = 1/N
sum(n -> cospi(2n * N⁻Âč) * cispi(-2n * K * N⁻Âč), 0:N) * N⁻Âč

? (In a function, of course, not with globals.)

4 Likes

Probably, I mostly stuck to ThreadsX to limit the changes for comparison’s sake because I’m not actually sure how ThreadsX does its multithreading. For example, Polyester.jl does multithreading more cheaply than Base.@threads, but I don’t use it and wouldn’t know if it could apply to this case. To add to the point about MATLAB’s approach being less complicated, it let users write straightforward matrix operations and abstracted away the mental load of choosing multithreading approaches and where to put them, at the cost of less control. But that praise/criticism of MATLAB is nothing new.

2 Likes

Answering my own question:

julia> using ThreadsX, OhMyThreads, Base.Threads

julia> function fe_cispi_threaded(N)
         ThreadsX.foreach(1:N) do K
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
fe_cispi_threaded (generic function with 1 method)

julia> function fe_cispi_threaded2(N)
         @threads for K in 1:N
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
fe_cispi_threaded2 (generic function with 1 method)

julia> function fe_cispi_threaded3(N)
         tforeach(1:N) do K
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
fe_cispi_threaded3 (generic function with 1 method)

julia> @time fe_cispi_threaded(10_000)
  0.226913 seconds (68.58 k allocations: 3.492 MiB, 114.83% compilation time)

julia> @time fe_cispi_threaded2(10_000)
  0.274059 seconds (68.97 k allocations: 3.501 MiB, 116.73% compilation time)

julia> @time fe_cispi_threaded3(10_000)
  0.293808 seconds (22.24 k allocations: 1.066 MiB, 89.68% compilation time)

julia> @time fe_cispi_threaded(100_000)
 20.061630 seconds (491 allocations: 44.734 KiB)

julia> @time fe_cispi_threaded2(100_000)
 24.175890 seconds (52 allocations: 5.297 KiB)

julia> @time fe_cispi_threaded3(100_000)
 23.827874 seconds (76 allocations: 5.906 KiB)

ThreadsX is somewhat faster, and allocated somewhat more.

3 Likes

Why cispi but not cospi?

That I simply copied from above.

I know that mulitthreading throws off the compilation time count but it’s still jarring to see.

I think there was a tacit agreement not to change that part from the original MATLAB/Julia examples, while there was a benchmark indicating that MATLAB was doing something like cis for exp with imaginary input.

In the grand scheme, the choice of these functions wasn't important because the gains are too modest to explain the initial discrepancy.
julia> @btime cos(pi*$3.4)
  7.700 ns (0 allocations: 0 bytes)
-0.30901699437494784

julia> @btime cospi($3.4)
  7.200 ns (0 allocations: 0 bytes)
-0.3090169943749477

julia> 7.7/7.2 # only 7% gain
1.0694444444444444

julia> @btime exp(im*pi*$3.4)
  13.814 ns (0 allocations: 0 bytes)
-0.30901699437494784 - 0.9510565162951534im

julia> @btime cis(pi*$3.4)
  10.711 ns (0 allocations: 0 bytes)
-0.30901699437494784 - 0.9510565162951534im

julia> @btime cispi($3.4)
  10.200 ns (0 allocations: 0 bytes)
-0.3090169943749477 - 0.9510565162951535im

julia> 13.814/10.711, 13.814/10.2 # 29% or 35% gain
(1.289702175333769, 1.3543137254901962)

Multithreading is the likely difference because its improvements to the Julia code are within the same ballpark. From what I could tell on the Mathworks forums, how MATLAB multithreads is mostly a black box to MATLAB users, too.

It was very tacit, and stuck out like a sore thumb through multiple posts, to the point where I had to know what was going on. Of course, it’s not significant for performance, but I think it merited mention, because it was getting very noticeable.

But also, according to the same logic, it should be cis not cispi.

(For some of us, these sort of things are like the sound of a scratched record, while for others it barely registers.)

1 Like