Lightweight tasks, Julia vs Elixir/OTP

I wanted to know, whether Julia can cope with Elixir/OTP lightweight tasks and made a comparison for chaining a lot of them following the example from “Programming in Elixir”.

My Julia code is:

function counter(next::Channel{Int})
    return ch -> put!(next, take!(ch)+1)
end

function create_processes(start::Channel{Int}, n::Int; spawn=false)
    next = start
    for i in 1:n
        next = Channel{Int}(counter(next), spawn=spawn)
    end

    put!(next, 0)

    take!(start)
end

const start = Channel{Int}()

I create a counter, returning a closure on a channel. Then I create a chain of n channels with the closure as a task, listening to its channel. I put 0 into the last created channel and take back the accumulated value from the channel I started with. It is a strictly sequential operation over multiple threads.

then

julia> using BenchmarkTools

julia> @btime create_processes(start, 10, spawn=true)
  24.595 μs (318 allocations: 17.84 KiB)
10

… The results were as follows on my 2013 MacBook Pro with 4 cores:

result

Julia scales better. With fewer tasks it is much faster than Elixir. What I don’t understand that it takes much longer if I set spawn=false:

julia> @btime create_processes(start, 1000, spawn=true)
  1.955 ms (33319 allocations: 1.77 MiB)
1000

julia> @btime create_processes(start, 1000)
  36.067 ms (32980 allocations: 1.77 MiB)
1000
5 Likes

How sure are you that the Elixir code you posted is a reasonable comparison? It looks like the example returns an interpolated string. I also believe Elixir has a JIT compiler, so you’ll want to ensure that you aren’t timing JIT overhead as well.

Re the spawn = true bit: interesting. I cannot reproduce (julia 1.3.1, linux). Running julia single threaded I get:

julia> @btime create_processes(start, 1000)
  2.777 ms (32980 allocations: 1.82 MiB)
1000

@btime create_processes(start, 1000, spawn=true)
  2.691 ms (31981 allocations: 1.80 MiB)

Then running julia with 10 threads (# of physical cores on my desktop) I get:

julia> @btime create_processes(start, 1000)
  2.749 ms (32980 allocations: 1.82 MiB)
1000

julia> @btime create_processes(start, 1000, spawn=true)
  2.098 ms (33768 allocations: 1.83 MiB)
1000

The string is returned only at the end of the chain. Elixir generates byte code and this creates a reasonable startup time, which may be responsible for Elixir taking longer at the first two measurements. Then Elixir/OTP and Julia scale nearly parallel in spawning multiple tasks, sending and receiving message. It was a positive surprise for me that Julia keeps up so well at those scales with OTP.

Interesting, your results are more what I expected. Interesting though that communication across threads doesn’t show any overhead.

2 Likes

Very cool. This would make a nice blog post

2 Likes

Now an astounding result – almost for me.

I investigated, why on my machine (MacBook Pro 2013, 4 cores) it takes so much longer with spawn=false. My first guess was, that an application gone crazy (like Atom or Safari, sorry guys :wink: ) was keeping my first thread busy. So I shutdown my machine, but still the problem persisted with activity monitor showing nothing strange. So I had to assume that MacOS is keeping part of its kernel within the 1st thread cache. Then I wrote a function allowing me to start my chain on other threads as well:

function startonthread(id::Int, start::Channel{Int}, n::Int; spawn=false)
    t = Task(nothing)
    @threads for i in 1:nthreads()
        if i == id 
            t = @async create_processes(start, n, spawn=spawn)
        end
    end
    fetch(t)
end

and …

julia> using BenchmarkTools

julia> @btime startonthread(1, start, 1000)
  33.660 ms (33025 allocations: 1.77 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  2.564 ms (33026 allocations: 1.77 MiB)
1000

julia> @btime startonthread(3, start, 1000)
  2.558 ms (33026 allocations: 1.77 MiB)
1000

julia> @btime startonthread(4, start, 1000)
  2.585 ms (33026 allocations: 1.77 MiB)
1000

Isn’t that stunning? Does this mean that scientific computing on such machines is much faster on others than the first thread?

edited: Please test this on other Macs as well and also on Windows machines! Thank you @Pbellive for showing that the problem doesn’t show up on Linux.

5 Likes

Interesting, I was not able to reproduce this on my MacBook Pro (13" 2019, OSX 10.14.6 (18G3020), julia 1.3.1):

julia> @btime startonthread(1, start, 1000)
  3.863 ms (33025 allocations: 1.77 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  2.331 ms (33026 allocations: 1.77 MiB)
1000

julia> @btime startonthread(3, start, 1000)
  2.372 ms (33025 allocations: 1.77 MiB)
1000

julia> @btime startonthread(4, start, 1000)
  2.293 ms (33023 allocations: 1.77 MiB)
1000
1 Like

Thanks for the deep analysis.

You have no idea how many different systems Julia gets tested and benchmarked on. Please file an issue instead of taking the “shame on you for not testing on my system” attitude.

1 Like

I think this is more call to action to other users to test this too than a complaint, since that behaviour is indeed surprising. I’m currently on my phone, but would be happy test this on my windows machine later today.

This is on a Windows 10 Desktop (Intel® Core™ i5-6600 CPU @ 3.30GHz) with Julia 1.3.1:

julia> @btime startonthread(1, start, 1000)
  3.330 ms (33525 allocations: 3.49 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  2.508 ms (33521 allocations: 3.49 MiB)
1000

:cold_sweat:, sorry, this did not come across as I intended and I didn’t even think that Julia was not tested enough. I only wanted – and still want – to see if this is a common problem.

Look, @tisztamo’s result is way better than mine but still it takes 3.863/(2.331+2.372+2.293)*3 = 1.656 times longer on thread 1 than on his other ones. This seems significant enough to me. Maybe @tisztamo, you can repeat with @btime startonthread(x, start, 100_000) to see if this is no statistical fluke.

Are we yet so far? I didn’t think yet. Where should the issue go?

Discourse is a good place to work out what’s happening but once enough digging is done, file an issue at https://github.com/JuliaLang/julia/issues/new to make sure that those who develop and maintain the task/threading system know and can look at it.

1 Like

Win10 Entreprise, i7-8550U @1.8Ghz 1.99Ghz

@btime startonthread(1, start, 1000)
  4.544 ms (33556 allocations: 3.55 MiB)
@btime startonthread(2, start, 1000)
  2.460 ms (33556 allocations: 3.54 MiB)
@btime startonthread(3, start, 1000)
  2.539 ms (33556 allocations: 3.54 MiB)
@btime startonthread(4, start, 1000)
  2.441 ms (33556 allocations: 3.54 MiB)

I opened an issue as you suggested. I cross linked this thread and my repo there. So I hope, it is enough to work on.

I think, it maybe helpful to see some further results, here or there under the issue.

Interesting stuff though! Thank you very much!

3 Likes
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu) # This is a lie - I'm on windows using WSL with a locally compiled julia
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Took some time before I could test this, but the effect is really pronounced if my laptop is in power saving mode/not connected to a power outlet:

julia> @btime startonthread(1, start, 1000)
  34.281 ms (33029 allocations: 1.82 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  5.699 ms (33030 allocations: 1.82 MiB)
1000

# to make sure the speedup was not just because of caching
julia> @btime startonthread(1, start, 1000)
  34.130 ms (33029 allocations: 1.82 MiB)
1000

julia> @btime startonthread(3, start, 1000)
  5.626 ms (33030 allocations: 1.82 MiB)
1000

julia> @btime startonthread(4, start, 1000)
  5.693 ms (33030 allocations: 1.82 MiB)
1000

In contrast, when in maximum-performance-give-me-all-you-got mode:

julia> @btime startonthread(1, start, 1000)
  14.043 ms (33029 allocations: 1.82 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  3.437 ms (33030 allocations: 1.82 MiB)
1000

julia> @btime startonthread(3, start, 1000)
  3.993 ms (33030 allocations: 1.82 MiB)
1000

julia> @btime startonthread(4, start, 1000)
  3.451 ms (33030 allocations: 1.82 MiB)
1000

I don’t think this is a julia specific issue, though it could maybe be “solved” using Threadpools.jl @bthreads to offload work onto anything but thread 1.

2 Likes

I don’t think either. I ask myself how we can exploit it. Is there some free lunch possible by transferring work to less used threads? My first trials show that it is not so easy. Copying data from one thread to another is costly. And for simple mathematical calculations without involving much memory there is no difference between threads. e.g.:

function machin_series(n::Int)
    qpi = 0
    for i in 1:n
        qpi += (-1)^(i+1)/(2i-1)
    end
    qpi*4
end

function startonthread(id::Int, f::F) where {F<:Function}
    t = Task(nothing)
    @threads for i in 1:nthreads()
        if i == id
            t = @async f()
        end
    end
    fetch(t)
end

results:

julia> @btime startonthread(1, ()->machin_series(1_000_000))
  56.837 ms (43 allocations: 4.20 KiB)
3.1415916535897743

julia> @btime startonthread(2, ()->machin_series(1_000_000))
  56.511 ms (45 allocations: 4.23 KiB)
3.1415916535897743

julia> @btime startonthread(3, ()->machin_series(1_000_000))
  57.028 ms (45 allocations: 4.23 KiB)
3.1415916535897743
1 Like

This is an interesting / somewhat strange result.

However, do note that the migration from single threaded concurrency with @async to parallel concurrency with Threads.@spawn is still a work in progress, especially for the larger package ecosystem. Therefore, a lot of stuff (anything started with @async) is going to run on thread 1 and you might just be seeing competition with other tasks doing background work. You could try avoiding some of that by using julia --startup-file=no do_benchmark.jl, with all the benchmark code put into do_benchmark.jl?

BTW if it helps — I read your “Come on guys!” as a friendly call out to investigate; “hey look at this weird/interesting thing I found” :slight_smile:

4 Likes

Thank you, it was meant like that. But I remove it, if it is misleading. Again, sorry, I’m no native English speaker.

I tried:

➜  chain (master) julia --startup-file=no do_benchmark.jl                     ✱
  32.348 ms (33025 allocations: 1.77 MiB)
  2.679 ms (33026 allocations: 1.77 MiB)
  2.700 ms (33020 allocations: 1.77 MiB)
  2.717 ms (33026 allocations: 1.77 MiB)
1 Like

Smaller but still notable effect on an Intel® Xeon® CPU E7- 4870:

julia> @btime startonthread(1, start, 1000)
  7.526 ms (33277 allocations: 1.85 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  4.055 ms (33278 allocations: 1.85 MiB)
1000

julia> @btime startonthread(3, start, 1000)
  4.458 ms (33278 allocations: 1.85 MiB)
1000

julia> @btime startonthread(4, start, 1000)
  4.204 ms (33278 allocations: 1.85 MiB)
1000

Sorry for my yesterday mistake about reproducibility! I had a misconception.

The difference is smaller at 100_000, but still noticeable:

julia> @btime startonthread(1, start, 1000)
  4.121 ms (33025 allocations: 1.77 MiB)
1000

julia> @btime startonthread(2, start, 1000)
  2.332 ms (33026 allocations: 1.77 MiB)
1000

julia> @btime startonthread(1, start, 100_000)
  579.464 ms (3469044 allocations: 314.20 MiB)
100000

julia> @btime startonthread(2, start, 100_000)
  447.345 ms (3469046 allocations: 314.20 MiB)
100000

Not sure if related, but at smaller lengths top reports CPU usage up to 340% on the julia process during benchmarking, while at larger lengths it goes lower, at 100_000 it is around 100%. It seems to be independent from the selected thread and I can see it also when running startonthread in a for cycle.