Threads (including Polyester) question

I get worse timing with threads (in the best case), and also different values, not pi, or at best if I’m lucky. I had already sped that benchmark up more than 5x, and curious if threading in some form could make even better:

julia> @time f(100000000)
  0.952234 seconds (25.32 M allocations: 397.933 MiB, 29.18% gc time, 27.44% compilation time)

julia> @time f(100000000)
  0.270784 seconds (25.00 M allocations: 381.478 MiB)

julia> @time f(100000000)
  0.689246 seconds (25.00 M allocations: 381.478 MiB, 46.37% gc time)

vs. 0.070 sec without threading (and get no allocations, I guess allocations with threads are part of if not all the problem):

julia> function f(rounds)
                  pi = 1.0
                  x  = -1.0
                  r2 = rounds + 2
                  vend = r2 - r2 % 8
                  Threads.@threads for i in 2*2:8*2:(r2*2)
                  # Common denominators method, half as many divisions:
                      pi += Float64(
                            -2.0f0 / fma(i, i, -1.0f0) +
                            # x / (2.0 * i + 1.0) +
                            -2.0f0 / (fma(i, i, 15.0f0) + 8f0i)
                            # x / (2.0 * i + 5.0)
                            -2.0f0 / (fma(i, i, 63f0) + 16f0i)
                            # x / (2.0 * i + 9.0) +
                            -2.0f0 / (fma(i, i, 143f0) + 24f0i)
                            # x / (2.0 * i + 13.0)

                  for i in vend+1:r2
                      pi += x / (2.0 * (i + 0.0) - 1.0)
                      x = -x
                  return pi*4

I realize I have a race-condition, but it shouldn’t slow down, just make the answer invalid…

With Polyester.jl threads I always get slower than the best case with Threads.@threads, exchanging it with @batch 0.77 sec (as many allocations).

Is it unrealistic a) to make either type of threads faster (than non-threaded, I suppose by not allocating), and b) at the same time give the correct value?

I’m hit by the accumulator bug, as explained in the excellent Guy Steele talk How to Think about Parallel Programming: Not! (it also has interesting history; begins with his intriguing “ugliest” ever program).

I did check to run the loop backwards to see if I get same (or more accurate) value, as he states as a requirement at 1:02:35, for (automatic) parallel speedup.

It seemed to me running the loop backwards (serially, no threads) should get me more accurate version. There wasn’t much of a difference, but if I recall if was actually a bit less.

Is it c) unrealistic that (Julia) would ever auto-parallelize such code (and as a bonus, keep correct)? Ie. when skipping Threads.@threads. There I previously had @simd, it was in the original, I suppose a key to make it fast, but after my optimization actually make it a bit slower… I’m not sure why.

I thought a first step to recover correctness would be:

pi = Threads.Atomic{Float64}(1.0)

and it didn’t work even with also corresponding line:

pi += Threads.Atomic{Float64}(

Bonus question, when adding Polyester.jl, I get:

  Downloaded artifact: Qt5Base
    Updating `~/.julia/environments/v1.8/Project.toml`
  [f517fe37] + Polyester v0.6.16
    Updating `~/.julia/environments/v1.8/Manifest.toml`
  [ea2cea3b] ↑ Qt5Base_jll v5.15.3+1 ⇒ v5.15.3+2
Precompiling project...
  Progress [====================================>    ]  8/9
  ◓ Plots

It’s rather strange, I suppose since I don’t use environments (or have lots of packages in my “global”), since Qt5Base (and Plots) are totally unrelated, so why does Pkg download extra…?

Thanks for Polyester.jl “The cheapest threads you can find!”, or in case I (someone) gets it to work faster/cheap. It reminded me of in case you like Icelandic music: (maybe more famous is their Ladyshave song, I haven’t kept up with their recent stuff).

1 Like

The accumulator is shared between threads, causing slow speed and wrong answers.

Try @tturbo. That should do the correct thing.

julia> @btime f(100000)
  25.320 μs (0 allocations: 0 bytes)

julia> @btime fturbo(100000)
  7.160 μs (0 allocations: 0 bytes)

This is with 4 threads.
Accuracy is similar,:

julia> π - fturbo(100000)

julia> π - f(100000)