Are certain operations just not good candidates for threading?


#1

Hi all,

I was experimenting with using threading to speed up a loop over vcat that swaps the first and second half of a sequence of vectors. I fire up Julia and verify:

julia> Threads.nthreads()
4

I wrote the following simple test routine which can be cut-and-paste to the REPL:

function f1(N::Int, J::Int)
    x = [ collect(1:N) for j = 1:J ]
    nhalf = floor(Int, N/2)
    for j = 1:J
        x[j][:] = vcat(x[j][nhalf+1:end], x[j][1:nhalf])
    end
    return x
end
function f1_thread(N::Int, J::Int)
    x = [ collect(1:N) for j = 1:J ]
    nhalf = floor(Int, N/2)
    Threads.@threads for j = 1:J
        x[j][:] = vcat(x[j][nhalf+1:end], x[j][1:nhalf])
    end
    return x
end

x1 = f1(10, 100);
x2 = f1_thread(10, 100);

using BenchmarkTools
@btime f1(10000, 10000);
@btime f1_thread(10000, 10000);

The threaded version is about 10% faster.

Is the conclusion here that this just simply isn’t a good operation for threading? Is the problem that the overhead in setting up the threaded loop is greater than the savings made by threading? If so, are there any other options for parallelizing loops like this?

I also tried using @distributed, but the cost of applying @everywhere to x massively outweighted any gains.

I also tried SharedArray, and had a similar issue with the setup costs, and also kept getting errors when I tried larger examples of the form:

ERROR: On worker 3:
SystemError: shm_open() failed for /jl011485a0lWXYifpyIZJdgO1sIM: Too many open files

So, is the conclusion here that this just isn’t an operation that can be sped up by using multiple cores?

Cheers and thanks for all responders.

Colin


#2

Yes and no. (No, as the overhead is not troublesome unless your arrays are small.) Operations like vcat that allocate heap objects (such as ordinary arrays) must interact with the GC, which is blocking (although thread-safe), so they should be avoided in threaded loops. Your problem is a good candidate for multithreading; if you write out the operation as an explicit loop using immutable temporaries you should see dramatic improvement.


#3

Thanks for responding. I think I understand what you mean about heap objects. My new implementation of the two test routines is:

function f1(N::Int, J::Int)
    x = [ collect(1:N) for j = 1:J ]
    nhalf = floor(Int, N/2)
    for j = 1:J
        for n = 1:nhalf
            (a, b) = (x[j][n], x[j][nhalf+n])
            x[j][n] = b
            x[j][nhalf+n] = a
        end
    end
    return x
end
function f1_thread(N::Int, J::Int)
    x = [ collect(1:N) for j = 1:J ]
    nhalf = floor(Int, N/2)
    Threads.@threads for j = 1:J
        for n = 1:nhalf
            (a, b) = (x[j][n], x[j][nhalf+n])
            x[j][n] = b
            x[j][nhalf+n] = a
        end
    end
    return x
end

But I’m still only seeing a small improvement:

julia> @btime f1(1000, 1000);
  1.432 ms (1001 allocations: 7.76 MiB)

julia> @btime f1_thread(1000, 1000);
  1.250 ms (1002 allocations: 7.76 MiB)

Have I misunderstood what you’re suggesting?

Thanks again, and apologies if I’m slow responding as I’m about to head home for the day.

Cheers,

Colin


#4

Your functions spend most of their time building the arrays. For the loop itself, I see acceleration of about 2x with 3 threads.

I shouldn’t have said this was good for threading; with so little computation, memory bandwidth and/or cache competition mean that more threads don’t do any better than 2x (at least for me).


#5

Gah! Rookie mistake.

Yep, I’m getting the same result as you now.

Thanks for walking me through it.

Cheers,

Colin