How to run tasks in parallel?

I am trying to understand how to work properly with multithreading. Currently, I am on julia 1.3.1, which is started with JULIA_NUM_THREADS=4 julia and trying to simulate simple embarrassingly parallel task.

f(x) = sum(x)

x1 = rand(10_000_000)
x2 = rand(10_000_000)

function g(x1, x2)
    t1 = Threads.@spawn f(x1)
    t2 = Threads.@spawn f(x2)
    fetch(t1) + fetch(t2)

When I am trying to benchmark I get strange results

julia> @btime f($x1)
  4.707 ms (0 allocations: 0 bytes)

julia> @btime f($x2)
  4.652 ms (0 allocations: 0 bytes)

julia> @btime g($x1, $x2)
  9.096 ms (20 allocations: 1.56 KiB)

so it looks like instead of being run in parallel tasks are run sequentially. I suppose something simple is missed, but I can’t figure out what exactly.

It’s not an overhead problem, because I get similar results with different sizes of vectors

x1 = rand(1_000_000);
x2 = rand(1_000_000);

@btime g($x1, $x2)
#  673.669 μs (20 allocations: 1.56 KiB)
@btime f($x1)
#  281.192 μs (0 allocations: 0 bytes)
1 Like

For me your usage looks fine. Maybe you are hitting memory bandwidth limit? What if your f would use more cpu and less memory?

function f(n)
    res = 0.0
    for i in 1:n
        res += rand()
    return res

function g(n1, n2)
    t1 = Threads.@spawn f(n1)
    t2 = Threads.@spawn f(n2)
    fetch(t1) + fetch(t2)
julia> @btime f($10^8)
  563.086 ms (0 allocations: 0 bytes)

julia> @btime g($10^8, $10^8)
  770.733 ms (16 allocations: 1.41 KiB)

What does Threads.nthreads() say? nvm

You are absolutely right. Out of curiosity I implemented the same code in C++ with similar results. Almost no performance gain in simple sum, but substantial difference with more cpu intensive calculations. Guess that’s wonderful world of multi-threading for you. So I suppose rule of thumb is “if you have many lightweight calculations than you should try to combine them in one big kernel”?

Oh, now I suddenly understood what broadcast fusing was about…


Yes. Personally I really like functional approach (like transducers or fmap). Instead of processing arrays and store intermediate result, construct processing flow by lazy iteration and apply composed functions once.

could you elaborate? I am not into it, but i guess it something similar to transducers

There is a great article, which definitely worth reading:

Well, one of the idea is actually the same, if you have many vectorized computations than it’s more practical to combine them into a single loop. Of course in case of broadcast allocations of temporary arrays was more important, but memory locality took it part as well.


Thanks, the article is great, I like the comparison to map.
I have 2 thoughts related to this topic:

  1. Languages like C# spend some extra cycles during iteration, so things like this execute faster in parallel. Julia does it so fast, that there is no room for that, RAM is just too slow to occupy CPU.
  2. Memory is a bottleneck for modern CPUs (32+ cores). I am trying to develop sound synthesizer that needs to sum massive number of sinusoids. I discovered it is faster to generate samples from differential equation in cpu, rather than read and interpolate them from pre-calculated tables.