I previously created a post where I described issues of my algorithm running on multi threading.
I think I finally found the issue:
I am using the default sin / cos functions and this post: https://github.com/JuliaLang/julia/issues/17395
led me to the idea that the sin / cos could cause the issue.
When I temporarly removed the sin / cos functions (but they are definitely required for the algorithm), the algorithm performed well on multi threading.
Thats why I did a few testings with the sample code
function test()
setprecision(1024)
Threads.@threads for i=1:10000000
sin(BigFloat(i))^2+cos(BigFloat(i))^2
end
end
It confirmed that this function is not threadsafe:
1 Threads: 347.428779 seconds (552.06 M allocations: 48.274 GiB, 0.93% gc time)
4 Threads: 118.963835 seconds (525.86 M allocations: 43.746 GiB, 4.19% gc time)
28 Threads: 137.302679 seconds (211.11 M allocations: 10.939 GiB, 8.41% gc time)
40 Threads: 125.110352 seconds (201.09 M allocations: 10.696 GiB, 10.58% gc time)
56 Threads: 119.466149 seconds (195.01 M allocations: 10.744 GiB, 11.99% gc time)
So at first, it would be nice if you could fix this, or at least give a warning because I had to spent quite some time finding this issue.
Which does indeed seem to be thread safe.
However, I cannot apply this workaround since I have to work with multiprecision and when I change Float64 → BigFloat at the code, Julia just crashes.
My question would now be if someone knows another workaround that works with multiprecision as well.
I know that not the BigFloats are causing this behavior.
What I was trying to say is that I cannot apply the workaround since it does not work for BigFloats.
And saying that the mpfr sin / cos do not scale well seems to be a bit of an understatement considiring that it is not that they just not gain linearly.
What I experienced is that they do not really scale at all, and they seem to be running fastest on 4 threads (for my actual algorithm the difference between 56 threads and 4 threads is even much bigger than on the sample code above).
I’m saying that you should not mentioned that thread at all. It’s completely unrelated to your issue. The workaround for a completely independent issue is of course not applicable so mentioned it at all will just confuse everyone else that will actually have experience with either of the issues.
Well, sure, I’m just saying that what you show is just a scaling issue. No matter how bad they scale, it’s not a threadsafety issue. FWIW, locks ususally don’t scale well and that will definately not make them not threadsafe.
AFAIK threadsafety does not particularly mean that the code crashes when being multithreaded, but that probably depends on the definition that you choose.
Anyways, I changed the title to avoid confusions.
And it might be the case that the original post is for another issue, but the things discussed in it can actually be applied for my case as well and would work if I would not be forced to use BigFloats
Crashing isn’t what I’m talking about either. Getting the right/expected result is, but not performance. Thread safety - Wikipedia. Also note that no one in the issue you linked mentioned thread safety about the original issue (a few off topic posts about rand did).
Errr, so you are saying that if what you have is a completely different issue you can use a completely different solution? Sure, that should be obvious, but again, I don’t see why mentioning that will help.
FWIW, the workaround mentioned in that thread is for a glibc bug that has since be fixed. sin and cos are not even the problem at all in that issue, it could be triggered by an arbitrary number of other functions.
This only suggest that the mpfr ones do not scale well, also has nothing to so with thread safety.
I wouldn’t say it scales great, but on a computer with 16 physical / 32 total logical cores, the minimum (median) times of:
function test(n = 10^5) # DavidBerghaus used 10^7
setprecision(1024)
Threads.@threads for i=1:n
sin(BigFloat(i))^2+cos(BigFloat(i))^2
end
end
Threads: Minimum (Median)
1: 2.808 s (2.833 s)
4: 978.811 ms (981.729 ms)
8: 749.010 ms (752.257 ms)
16: 489.371 ms (539.649 ms)
32: 347.542 ms (351.071 ms)
I should look into how work is actually distributed among cores. In earlier tests I saw a pattern when using <= 16 threads that looked like many of the busy logical cores shared physical cores. That may explain the large improvement from 16 → 32 threads.
I only saw an 8x speed improvement instead of the ideal 16x, but performance increased steadily.
Obvious question: how many cores does your computer actually have, @DavidBerghaus ?
Are you testing on a 4-core laptop? You saw a big improvement with 4 threads.
I am calculating an irrational number with as much digits as possible. For this I have to calculate the determinant of a large ill conditioned matrix. I have to work with high precision to overcome the ill conditioning (I need to work with a few thousand digits precision for that)
I did the same testings with Ubuntu 16.04 and Julia 0.6.3 now.
Unfortunately the results are pretty much the same.
I also noticed that the CPU usage doesnt stay constant at 5600% and often drops down to 3300%.
Have you experienced this behavior as well?
Wow, that is weird, I do not think that this function should require lots of memory.
Have you tried to change the precision to 64 bits?
And does the code work on an older version of julia for you?
I’m on Ubuntu 16.04 because of OpenCL graphics drivers. I’ll upgrade when AMD/ROCM supports 18.04 (I think they’ll support all Linux distributions with kernel 4.18), and do-release-upgrade works (I believe that’ll be in a couple weeks, with the release of 18.04.1).
But if you want to strike down variables: I’m on kernel 4.13.0-45-generic, and Julia was built from source.
@bernhard, the loop isn’t even storing anything, so it should hardly require more than nthreads() times more memory than
does. Actual peak should depend on how often gc triggers.
What happens if you try calling the gc manually?
function test(n = 10^5)
setprecision(512)
Threads.@threads for i=1:n
sin(BigFloat(i))^2+cos(BigFloat(i))^2
GC.gc()
end
end
I’d be surprised if that fixes anything, but it’d be nice to confirm that it isn’t somehow actually running out of memory. Although it’d take a much bigger n than some of the small values you used to run out of 128 gigs…