Julia reports it’s using 8 cores but I can’t detect any improvement by using Threads.@threads. @inbounds doesn’t seem to improve either.
I guess @simd is not appropriate here because swapping the two inner lines won’t produce the same result.
I’m far from a threading expert but: the code sample you provided doesn’t look thread-safe; the loop body reads from and modifies the variable S, so different iterations of the loop may simultaneously operate on S.
Maybe you could explain a little what the code is supposed to do? It also looks like the function doesn’t return anything.
All that said, I think you can speed up significantly by allocating less. Consider two similar functions:
function f_allocate(N)
s = 0
for i in 1:N
range = 1:i
s += sum(range .* range)
end
return s
end
function f_fast(N)
s = 0
for i in 1:N
s += sum(j^2 for j in 1:i)
end
return s
end
They’re the same except for allocating the full vector range .* range. And the fast (non-allocating) version is 100 times faster!
Nothing important, it’s just to test.
It creates vectors of increasing size and sums the square of its elements, and then adds everything.
I think it’s thread-safe because the sum is commutative.
PS:
I forgot to say “if possible use the same algorithm” because the idea is to compare with R with the same algorithm. Though, maybe you are using the same algorithm, I need to think about it.
(I edited my post above with a performance suggestion.)
I am skeptical of that thread safety. Here’s a test:
function f_thread(n)
s = 0
Threads.@threads for i in 1:n
s += sum(j*j for j in 1:i)
end
return s
end
julia> f_thread(1000)
39967308241
julia> f_thread(1000)
4114724460
julia> f_thread(1000)
31381948695
That f_thread function has the same pattern of modifying a variable outside the threaded loop as your f2, and despite having no randomness, its result is nondeterministic - because of race conditions.
You are describing a property of the operation and the correct conclusion is that the operation is parallelizable (I think you also need associativity btw)
Thread safety is a property of the code, your code is not.
For thread-safety, each thread needs to accumulate into a different target. In order to not make your CPU weep, the targets need to live in different cache-lines. E.g.
function f_thread(n)
svals = zeros(Int, 8*(1+Threads.nthreads()))
Threads.@threads for i in 1:n
svals[8*(1+Threads.threadid())] += sum(j*j for j in 1:i)
end
return sum(svals)
end
The 1+Threads.threadid() is probably overly paranoid. It is to handle the case that the allocator decides to put the jl_array struct and the first entry of svals into the same cache line, and the compiler also fails to hoist the load of pointer(svals) out of the loop. In that case, the updates of the first thread might dirty the cache entry for pointer(svals) on the other cores.
Further, your example is no good.
julia> function f(N)
s= 0
for i=1:N for j=1:i s+= j*j end end
s
end;
julia> @btime f(1000)
3.067 μs (0 allocations: 0 bytes)
83667083500
julia> @btime f(10000)
30.536 μs (0 allocations: 0 bytes)
833666708335000
The result is in the wrong complexity class (you want to see quadratic number of operations, not linear!). What happens is that llvm manages to optimize the inner loop into an explicit expression: sum(j^p for j=1:N) actually is a polynomial in N, llvm figures this out and then computes this in O(1).
IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in your inner loops.
Nonetheless, let’s look into the thread safety of your code. This operation:
S += sum(se .* se)
is not thread safe, since the way this operation works internally is by doing:
1. read current value of S
2. add to this value
3. write new value back to S
Now, if you have two threads doing this in parallel, these operations might interleave to produce the wrong result:
1. thread 1 reads current value of S
2. thread 2 reads current value of S
3. thread 1 adds to its version of S
4. thread 2 adds to its version of S
5. thread 1 writes new value to S
6. thread 2 writes new value to S, overwriting what thread 1 just wrote
This is commonly referred to as a race condition. You can see it in action in the example that @evanfields posted above:
function f_thread(n)
s = 0
Threads.@threads for i in 1:n
s += sum(j*j for j in 1:i)
end
return s
end
julia> f_thread(1000)
43483971876
julia> f_thread(1000)
80136425181
julia> f_thread(1000)
57182835925
One solution to this problem is to use atomic operations:
function f_thread_atomic(n)
s = Threads.Atomic{Int}(0)
Threads.@threads for i in 1:n
Threads.atomic_add!(s, sum(j*j for j in 1:i))
end
return s[]
end
julia> f_thread_atomic(1000)
83667083500
julia> f_thread_atomic(1000)
83667083500
Comparing to the analytical solution we can see that this yields the correct result:
I’ve tried the code with Threads.Atomic and it’s slower. I guess Julia needs a lot of resources to work with the threads.
What kind of things will improve with multithreading and what kind not?
Calculations which can be separated to parts which are themselves significant enough to justify the overhead are more likely to benefit from parallel processing. Your example above is exactly the opposite, small and cheap calculations (+) get swamped by the overhead.
That’s fine. But then why are you asking for help to speed it up? If you just want to test R vs Julia running identical code, then I guess you have already achieved that.
Speeding up code typically consists of (off the top of my head):
removing type instabilities
avoiding unnecessary operations
avoiding unnecessary allocations
changing the underlying algorithm
adding performance annotations
As I understand it, you are exclusively interested in the last point, is that correct? Unless you allow any of the first four points, multithreading wont work, @simd will do nothing, and there is nothing to @inbounds. @fastmath doesn’t apply to integer math, so what’s left, then?
One thing, though. Does your code even run? The type assertion 0::Int32 just gives me an error. Are you running on a 32-bit platform? Anyway, I suggest that you remove the type assertion, and ::Int32 in the function signature.
I meant using in-place functions that do the same algorithm but faster, for example by populating vectors and arrays without allocating a new array. Or adding macros such as @simd, @threads or @inbounds.
Does that mean that it is OK to change sum(se .* se)
into sum(abs2, 1:i) or sum(j^2 for j in 1:i)? Because those two (especially the last) will give enormous speedup, much more than multithreading ever could if it even were possible.