I wanted to know, whether Julia can cope with Elixir/OTP lightweight tasks and made a comparison for chaining a lot of them following the example from “Programming in Elixir”.
My Julia code is:
function counter(next::Channel{Int})
return ch -> put!(next, take!(ch)+1)
end
function create_processes(start::Channel{Int}, n::Int; spawn=false)
next = start
for i in 1:n
next = Channel{Int}(counter(next), spawn=spawn)
end
put!(next, 0)
take!(start)
end
const start = Channel{Int}()
I create a counter, returning a closure on a channel. Then I create a chain of n channels with the closure as a task, listening to its channel. I put 0 into the last created channel and take back the accumulated value from the channel I started with. It is a strictly sequential operation over multiple threads.
How sure are you that the Elixir code you posted is a reasonable comparison? It looks like the example returns an interpolated string. I also believe Elixir has a JIT compiler, so you’ll want to ensure that you aren’t timing JIT overhead as well.
The string is returned only at the end of the chain. Elixir generates byte code and this creates a reasonable startup time, which may be responsible for Elixir taking longer at the first two measurements. Then Elixir/OTP and Julia scale nearly parallel in spawning multiple tasks, sending and receiving message. It was a positive surprise for me that Julia keeps up so well at those scales with OTP.
Interesting, your results are more what I expected. Interesting though that communication across threads doesn’t show any overhead.
I investigated, why on my machine (MacBook Pro 2013, 4 cores) it takes so much longer with spawn=false. My first guess was, that an application gone crazy (like Atom or Safari, sorry guys ) was keeping my first thread busy. So I shutdown my machine, but still the problem persisted with activity monitor showing nothing strange. So I had to assume that MacOS is keeping part of its kernel within the 1st thread cache. Then I wrote a function allowing me to start my chain on other threads as well:
function startonthread(id::Int, start::Channel{Int}, n::Int; spawn=false)
t = Task(nothing)
@threads for i in 1:nthreads()
if i == id
t = @async create_processes(start, n, spawn=spawn)
end
end
fetch(t)
end
Isn’t that stunning? Does this mean that scientific computing on such machines is much faster on others than the first thread?
edited: Please test this on other Macs as well and also on Windows machines! Thank you @Pbellive for showing that the problem doesn’t show up on Linux.
You have no idea how many different systems Julia gets tested and benchmarked on. Please file an issue instead of taking the “shame on you for not testing on my system” attitude.
I think this is more call to action to other users to test this too than a complaint, since that behaviour is indeed surprising. I’m currently on my phone, but would be happy test this on my windows machine later today.
, sorry, this did not come across as I intended and I didn’t even think that Julia was not tested enough. I only wanted – and still want – to see if this is a common problem.
Look, @tisztamo’s result is way better than mine but still it takes 3.863/(2.331+2.372+2.293)*3 = 1.656 times longer on thread 1 than on his other ones. This seems significant enough to me. Maybe @tisztamo, you can repeat with @btime startonthread(x, start, 100_000) to see if this is no statistical fluke.
Are we yet so far? I didn’t think yet. Where should the issue go?
Discourse is a good place to work out what’s happening but once enough digging is done, file an issue at Sign in to GitHub · GitHub to make sure that those who develop and maintain the task/threading system know and can look at it.
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu) # This is a lie - I'm on windows using WSL with a locally compiled julia
CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
Took some time before I could test this, but the effect is really pronounced if my laptop is in power saving mode/not connected to a power outlet:
julia> @btime startonthread(1, start, 1000)
34.281 ms (33029 allocations: 1.82 MiB)
1000
julia> @btime startonthread(2, start, 1000)
5.699 ms (33030 allocations: 1.82 MiB)
1000
# to make sure the speedup was not just because of caching
julia> @btime startonthread(1, start, 1000)
34.130 ms (33029 allocations: 1.82 MiB)
1000
julia> @btime startonthread(3, start, 1000)
5.626 ms (33030 allocations: 1.82 MiB)
1000
julia> @btime startonthread(4, start, 1000)
5.693 ms (33030 allocations: 1.82 MiB)
1000
In contrast, when in maximum-performance-give-me-all-you-got mode:
I don’t think this is a julia specific issue, though it could maybe be “solved” using Threadpools.jl @bthreads to offload work onto anything but thread 1.
I don’t think either. I ask myself how we can exploit it. Is there some free lunch possible by transferring work to less used threads? My first trials show that it is not so easy. Copying data from one thread to another is costly. And for simple mathematical calculations without involving much memory there is no difference between threads. e.g.:
function machin_series(n::Int)
qpi = 0
for i in 1:n
qpi += (-1)^(i+1)/(2i-1)
end
qpi*4
end
function startonthread(id::Int, f::F) where {F<:Function}
t = Task(nothing)
@threads for i in 1:nthreads()
if i == id
t = @async f()
end
end
fetch(t)
end
However, do note that the migration from single threaded concurrency with @async to parallel concurrency with Threads.@spawn is still a work in progress, especially for the larger package ecosystem. Therefore, a lot of stuff (anything started with @async) is going to run on thread 1 and you might just be seeing competition with other tasks doing background work. You could try avoiding some of that by using julia --startup-file=no do_benchmark.jl, with all the benchmark code put into do_benchmark.jl?
BTW if it helps — I read your “Come on guys!” as a friendly call out to investigate; “hey look at this weird/interesting thing I found”
Not sure if related, but at smaller lengths top reports CPU usage up to 340% on the julia process during benchmarking, while at larger lengths it goes lower, at 100_000 it is around 100%. It seems to be independent from the selected thread and I can see it also when running startonthread in a for cycle.