Default sin / cos functions do not scale on multithreading

I previously created a post where I described issues of my algorithm running on multi threading.

I think I finally found the issue:
I am using the default sin / cos functions and this post:
https://github.com/JuliaLang/julia/issues/17395
led me to the idea that the sin / cos could cause the issue.
When I temporarly removed the sin / cos functions (but they are definitely required for the algorithm), the algorithm performed well on multi threading.

Thats why I did a few testings with the sample code

function test()
setprecision(1024)
Threads.@threads for i=1:10000000
sin(BigFloat(i))^2+cos(BigFloat(i))^2
end
end

It confirmed that this function is not threadsafe:
1 Threads: 347.428779 seconds (552.06 M allocations: 48.274 GiB, 0.93% gc time)
4 Threads: 118.963835 seconds (525.86 M allocations: 43.746 GiB, 4.19% gc time)
28 Threads: 137.302679 seconds (211.11 M allocations: 10.939 GiB, 8.41% gc time)
40 Threads: 125.110352 seconds (201.09 M allocations: 10.696 GiB, 10.58% gc time)
56 Threads: 119.466149 seconds (195.01 M allocations: 10.744 GiB, 11.99% gc time)

So at first, it would be nice if you could fix this, or at least give a warning because I had to spent quite some time finding this issue.

I saw a workaround at the previous post:
https://github.com/JuliaLang/julia/issues/17395#issuecomment-232343762

Which does indeed seem to be thread safe.
However, I cannot apply this workaround since I have to work with multiprecision and when I change Float64 → BigFloat at the code, Julia just crashes.

My question would now be if someone knows another workaround that works with multiprecision as well.

Thanks in advance

Again, that post has nothing to do with BigFloat.

This only suggest that the mpfr ones do not scale well, also has nothing to so with thread safety.

I know that not the BigFloats are causing this behavior.
What I was trying to say is that I cannot apply the workaround since it does not work for BigFloats.

And saying that the mpfr sin / cos do not scale well seems to be a bit of an understatement considiring that it is not that they just not gain linearly.
What I experienced is that they do not really scale at all, and they seem to be running fastest on 4 threads (for my actual algorithm the difference between 56 threads and 4 threads is even much bigger than on the sample code above).

I’m saying that you should not mentioned that thread at all. It’s completely unrelated to your issue. The workaround for a completely independent issue is of course not applicable so mentioned it at all will just confuse everyone else that will actually have experience with either of the issues.

Well, sure, I’m just saying that what you show is just a scaling issue. No matter how bad they scale, it’s not a threadsafety issue. FWIW, locks ususally don’t scale well and that will definately not make them not threadsafe.

AFAIK threadsafety does not particularly mean that the code crashes when being multithreaded, but that probably depends on the definition that you choose.
Anyways, I changed the title to avoid confusions.

And it might be the case that the original post is for another issue, but the things discussed in it can actually be applied for my case as well and would work if I would not be forced to use BigFloats

Crashing isn’t what I’m talking about either. Getting the right/expected result is, but not performance. Thread safety - Wikipedia. Also note that no one in the issue you linked mentioned thread safety about the original issue (a few off topic posts about rand did).

Errr, so you are saying that if what you have is a completely different issue you can use a completely different solution? Sure, that should be obvious, but again, I don’t see why mentioning that will help.
FWIW, the workaround mentioned in that thread is for a glibc bug that has since be fixed. sin and cos are not even the problem at all in that issue, it could be triggered by an arbitrary number of other functions.

This only suggest that the mpfr ones do not scale well, also has nothing to so with thread safety.

I wouldn’t say it scales great, but on a computer with 16 physical / 32 total logical cores, the minimum (median) times of:

function test(n = 10^5) # DavidBerghaus used 10^7
    setprecision(1024)
    Threads.@threads for i=1:n
        sin(BigFloat(i))^2+cos(BigFloat(i))^2
    end
end
Threads: Minimum (Median)
1:  2.808 s (2.833 s)
4: 978.811 ms (981.729 ms)
8: 749.010 ms (752.257 ms)
16: 489.371 ms (539.649 ms)
32: 347.542 ms (351.071 ms)

I should look into how work is actually distributed among cores. In earlier tests I saw a pattern when using <= 16 threads that looked like many of the busy logical cores shared physical cores. That may explain the large improvement from 16 → 32 threads.

I only saw an 8x speed improvement instead of the ideal 16x, but performance increased steadily.
Obvious question: how many cores does your computer actually have, @DavidBerghaus ?
Are you testing on a 4-core laptop? You saw a big improvement with 4 threads.

julia> Sys.CPU_CORES
32
1 Like

How could I test with 56 threads on a 4 core laptop?! :grin:
I am testing it on a cluster with 2x14 core xeon processor, so 56 logical threads in total.

For small times, as you did on your test, the scale seems to work much better for me as well.
Have you also tried to let the code run a bit longer?

Oops, I didn’t realize I couldn’t launch more threads than logical cores the computer had.

Anyway,

julia> @time test(10^7) # 1 thread
289.105270 seconds (443.09 M allocations: 33.089 GiB, 0.35% gc time)

julia> @time test(10^7) # 32 threads
 35.755172 seconds (183.62 M allocations: 9.492 GiB)

So same pattern with 10^7 (and about 100x slower than 10^5).
Threading shouldn’t get slower with more iterations.

1 Like

Thank you very much for the testing!

Very weird that you get the expected behavior while I do not.
Which OS are you using?

Ubuntu 16.04. My tests were on a 3 day old master.
0.6.4 was slower, taking about 3s / 0.46s with 1 / 32 threads for the 10^5 iterations.

Wow, I did not expect that the Julia version causes performance differences as well here…

I am using Ubuntu1804 with Julia 0.6.3
I will check with Ubuntu1604 later and inform you if it made a difference for me.

@DavidBerghaus, out of curiosity, why does your application require BigFloats? I’ve always wondered what types of applications they are used in.

I am calculating an irrational number with as much digits as possible. For this I have to calculate the determinant of a large ill conditioned matrix. I have to work with high precision to overcome the ill conditioning (I need to work with a few thousand digits precision for that)

1 Like

I did the same testings with Ubuntu 16.04 and Julia 0.6.3 now.
Unfortunately the results are pretty much the same.
I also noticed that the CPU usage doesnt stay constant at 5600% and often drops down to 3300%.
Have you experienced this behavior as well?

that piece of code actually crashes my Julia 0.7beta on Windows.
Any ideas?

How much memory does each thread get?
I have 18 cores and 128GB ram on this machine.
It is unclear to me why it complains about memory.

I just checked: it also crashes with 2 threads.

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _ |  |
  | | |_| | | | (_| |  |  Version 0.7.0-beta.0 (2018-06-24 01:32 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-w64-mingw32

julia> function test(n = 10^5)
           setprecision(1024)
               Threads.@threads for i=1:n
                       sin(BigFloat(i))^2+cos(BigFloat(i))^2
                           end
                           end
test (generic function with 2 methods)

julia> @show Threads.nthreads()
Threads.nthreads() = 16
16

julia> @time test()

Error thrown in threaded loop on thread 5: OutOfMemoryError()
Error thrown in threaded loop on thread 10: OutOfMemoryError()
Error thrown in threaded loop on thread 7: OutOfMemoryError()
Error thrown in threaded loop on thread 15: OutOfMemoryError()
Error thrown in threaded loop on thread 11: OutOfMemoryError()
Error thrown in threaded loop on thread 2: OutOfMemoryError()
Error thrown in threaded loop on thread 14: OutOfMemoryError()
Error thrown in threaded loop on thread 3: OutOfMemoryError()
Error thrown in threaded loop on thread 12: OutOfMemoryError()
Error thrown in threaded loop on thread 6: OutOfMemoryError()
Error thrown in threaded loop on thread 4: OutOfMemoryError()
Error thrown in threaded loop on thread 8: OutOfMemoryError()
Error thrown in threaded loop on thread 13: OutOfMemoryError()
Error thrown in threaded loop on thread 9: OutOfMemoryError()
C:\julia-0.7.x\bin>
1 Like

Wow, that is weird, I do not think that this function should require lots of memory.
Have you tried to change the precision to 64 bits?
And does the code work on an older version of julia for you?

It does work for n=64 on Julia 0.6.3.
For n=512 with 32 threads, julia crashes without any message though.

  _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _ |  |
  | | |_| | | | (_| |  |  Version 0.6.3 (2018-05-28 20:20 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-w64-mingw32

julia> Threads.nthreads()
32

julia> function test(n = 10^5)
                  setprecision(64)
                      Threads.@threads for i=1:n
                              sin(BigFloat(i))^2+cos(BigFloat(i))^2
                                  end
                                  end
test (generic function with 2 methods)

julia> test()

julia> @time test()
  0.625854 seconds (1.56 M allocations: 15.338 MiB, 27.62% gc time)

julia> function test(n = 10^5)
                  setprecision(512)
                      Threads.@threads for i=1:n
                              sin(BigFloat(i))^2+cos(BigFloat(i))^2
                                  end
                                  end
test (generic function with 2 methods)

julia> @time test()

C:\julia-0.6.x\bin>


How does it scale when you preallocate a vector to store the results in, and a vector with the inputs to the loop?

I’m on Ubuntu 16.04 because of OpenCL graphics drivers. I’ll upgrade when AMD/ROCM supports 18.04 (I think they’ll support all Linux distributions with kernel 4.18), and do-release-upgrade works (I believe that’ll be in a couple weeks, with the release of 18.04.1).
But if you want to strike down variables: I’m on kernel 4.13.0-45-generic, and Julia was built from source.

@bernhard, the loop isn’t even storing anything, so it should hardly require more than nthreads() times more memory than

julia> setprecision(1024)

julia> @time sin(BigFloat(1))^2+cos(BigFloat(1))^2;
  0.000160 seconds (116 allocations: 9.186 KiB)

julia> @time sin(BigFloat(1))^2+cos(BigFloat(1))^2;
  0.000131 seconds (37 allocations: 3.250 KiB)

does. Actual peak should depend on how often gc triggers.
What happens if you try calling the gc manually?

function test(n = 10^5)
    setprecision(512)
    Threads.@threads for i=1:n
        sin(BigFloat(i))^2+cos(BigFloat(i))^2
        GC.gc()
    end
end

I’d be surprised if that fixes anything, but it’d be nice to confirm that it isn’t somehow actually running out of memory. Although it’d take a much bigger n than some of the small values you used to run out of 128 gigs…