Threads/Parallel


#1

Trying to understand best way to code loops/calculations for a simple array. Code is below. I would like to be able to active all 4 cpus on my laptop so trying to work out how to do this. I find the manual complicated to understand this simple activity with all seemingly possible options. NB. I’ve not included use of @parallel on the loops since I expected that the dotprallel. would have used all threads.
NB. Have set in shell 4 threads before running Julia

N=1e8
Threads.nthreads()

function trial(N::Int64)
    function dotparallel(x)
        return sin.(x.^2)
    end
    function dotparallel!(x)
        x=sin.(x.^2)
    end
    # Pre-compile before testing
    o1=1.
    o1=dotparallel(o1)
    o2=0.
    o2=dotparallel!(o2)

    # Now run parallel trials
    ## Initialise arrays
    
    ## Run first trial
    ck1=zeros(Int(N))
    @time begin
        for x=1:N
            ck1[Int(x)]=dotparallel(x)
        end
    end
    println(*("Check",@sprintf("%g",ck1[N])))
    
    ## Run second trial
    ck1=zeros(Int(N))
    @time begin
        Threads.@threads for x=1:N
            ck1[Int(x)]=dotparallel(x)
        end
    end
    println(*("Check",@sprintf("%g",ck1[N])))
    
    ## Run third trial - using inplace dot operator
    ck1=[x for x=1:N]
    @time ck1=dotparallel.(ck1)
    println(*("Check",@sprintf("%g",ck1[N])))
    
    ## Run third trial - inplace array
    ck1=[x for x=1:N]
    @time ck1=dotparallel!.(ck1)
    println(*("Check",@sprintf("%g",ck1[N])))
    
end

# Run pre-compilation trial
println("Pre-compilation run")
trial(2)

# Run benchmark case
println("Speed run")
trial(10000)

Results:
Pre-compilation run
0.002612 seconds (28 allocations: 1.688 KB)
Check-0.756802
0.040687 seconds (7.34 k allocations: 298.478 KB)
Check-0.756802
0.067961 seconds (12.17 k allocations: 520.295 KB)
Check-0.756802
0.066354 seconds (12.79 k allocations: 550.111 KB)
Check-0.756802
Speed run
33.462584 seconds (159.47 k allocations: 8.537 MB)
Check0.931639
52.864397 seconds (159.47 k allocations: 8.537 MB, 0.02% gc time)
Check0.931639
117.130010 seconds (160.03 k allocations: 8.622 MB)
Check0.931639
137.695686 seconds (169.52 k allocations: 8.767 MB)
Check0.931639


#2

There are some unusual features in your code:

  • You are defining your functions inside the main function. Not sure if this messes with the precompilation, but it is probably better to define dotparallel etc. outside trial.
  • You are doing ‘double dotting’, having dots inside dotparallel and then also calling it with and without a dot. It’s not necessarily wrong, but it’s confusing.
  • dotparallel! is not working in-place, you are creating a new variable x that overwrites the old one.
  • There is no need to sprinkle your code with Int(N), Int(x), and so on, those values are already Ints, it just visual noise.

Here is some code that does more-or-less what you are trying to accomplish. Note that in-place assignment can be achieved with .=

Oddly, I have to assign a separate array y for the threaded loop, otherwise everything slows down significantly.

f(x) = sin(x^2)
function trial(N::Int64)
    r = rand(N)
    x = similar(r)
    
    @time for k in eachindex(r) 
        x[k] = f(r[k])
    end

    @time x = f.(r)  # broadcasting
    @time x .= f.(r)  # in-place broadcasting

    y = similar(r)  # have to make a new array here for some reason
    @time Threads.@threads for k in eachindex(r) 
        y[k] = f(r[k])
    end
    
    return x == y
end

After warm-up:

julia> test(10^8)
  1.377276 seconds
  1.395597 seconds (4 allocations: 762.940 MB, 4.82% gc time)
  0.991549 seconds (1 allocation: 16 bytes)
  0.451913 seconds (2 allocations: 64 bytes)
true

I’m surprised that threads are actually giving a speed-up here. I would have thought the overhead would drown out any parallel gains.


#3

Oh, another big problem. Here you are taking an array of Ints and you want to change it in-place to an array of Float64s? That is a bad idea. I suggest you read up on “type stability” in the manual.


#4

Thanks your reply is excellent and is helping to understand how Julia is working. A further question. Using broadcasting is really neat but it seems natural that it should automatically use all cores in my laptop. How can the broadcasting method be used but each operation automatically spread on all cores. I’d prefer my library to do this.


#5

. broadcasting is very new. In fact, . fusion of old operators like .+ only started fusing on Julia’s v0.6 master yesterday. Threading is still experimental. However, with threading becoming more stable, and with . broadcasting now very solid, I am pretty sure this is coming soon.

Until then, the easiest way is just to loop and add Threads.@threads, i.e. instead of y.=f.(x), do

Threads.@threads for i in eachindex(x)
  y[i] = f(x[i])
end

Or you can use a multiprocessing construct like pmap.


#6

I was surprised to see that threads were giving a nice speed-up here, since I thought sin(x^2) was such a light-weight calculation that the overhead of threads would drown out the gains. Maybe someone can shed some light on how this works? When can one expect speed-ups from multi-threading?

Perhaps there is not a new thread created for each element in the array?

And one more thing: in my code in post 2, I had to create a new array y to get the speed-ups when threading. If I re-used x everything slowed down considerably, not just the threaded loop.


#7

Wow, no, I think not, that shouldn’t help for any array.

If I recall, you can set how many with an ENV variable, but the default is the number of cores, I think virtual. I guess you most likely don’t have to mess with the default. Except to turn on (it’s experimental now, so default is 1).


#8

Perhaps I misspoke. I realize there can only be a few threads at the same time, but I thought they would maybe be created and destroyed continuously.

So how would it work in a case like this? Why so little overhead?


#9

Usually the creation of a thread/process is more expensive than sending some data/message to a existing thread/process. So in general, it is more efficient to start multiple threads at the beginning (or lazily) and then schedule jobs to those worker threads/processes when necessary.

For a fixed size threaded loops, the simplest and reasonably efficient thing to do is to divide the loop range into multiple sub ranges and let each threads handle one of the sub ranges.This way, you can also take advantage of the single thread loop optimizations.


#10

In the code I wrote there was no explicit subdivision of the range. Does that mean that the @thread macro does that automatically?


#11

Yes.


<20 char limit>


#12

Thanks all - this has been most helpful.


#13

I’m also still trying to learn using the parallel processing constructs of Julia, but I can guess at the problem here:
I suspect that is similar to what was killing my attempt at using @parallel (in that case over processes) a few weeks ago. Didn’t want to try threading until it was stable…

When modifying the existing variable, which is also being used by the parallel threads, it might be a case that you running into synchronisation/locking issues. I’m not sure if the compiler (might even be cache coherence issues at the bottom) is able to determine that each thread is only working on its own part of the array and can go ahead and modify that without having to worry about falling over the feet of the other threads. Working into a separate output array, the one becomes read only and the one write only and it thus becomes easier.

Or is this not really something we need to worry about, that the compiler fully handles it and there is actually another reason that someone can think of?

Thanks


#14

Threading is still an experimental feature in Julia. Once it is stable we can consider introducing implicit parallelism into constructs like broadcast and/or comprehensions.


#15

A very mundane question: How do I start a new thread?

julia> Threads.nthreads()
1

If I run DNF’s trial function I get

julia> trial(10^8)
  1.133145 seconds
  1.283133 seconds (4 allocations: 762.940 MB, 3.83% gc time)
  1.028908 seconds (1 allocation: 16 bytes)
  1.277151 seconds (2 allocations: 64 bytes)
true

#16

http://docs.julialang.org/en/stable/manual/parallel-computing/#multi-threading-experimental

[quote]The number of threads Julia starts up with is controlled by an environment variable
called JULIA_NUM_THREADS. Now, let’s start up Julia with 4 threads:

export JULIA_NUM_THREADS=4

(The above command works on bourne shells on Linux and OSX. Note that if you’re using
a C shell on these platforms, you should use the keyword set instead of export.
If you’re on Windows, start up the command line in the location of julia.exe and
use set instead of export.)[/quote]


#17

Unlike Java or C++, Julia’s threading model does not involve explicitly starting or stopping actual threads. Instead, you express potential concurrency using the @threads macro, and in the future, by starting tasks, which will at some point be possibly scheduled on different kernel threads. This is similar to the Cilk or Go threading models, if you’re familiar with those.


#18

Thanks – I should have RTFM… I’m unfortunately not familiar with Cilk or Go, but I this thread has definitely helped my understanding on how to use them in Julia.


#19

I’m starting to use Threads. I definitely see some speedup, but the initial overhead is significant.
How can I start up the threads as you mention?


#20

That’s already how it’s implemented.