Dagger, No speed increase, parallel computing


#1

I found no speed increase by using Dagger. Can anyone give some suggestions? The code and the results are as follows.

addprocs()
using Dagger
@show workers()

@everywhere function f1(x)
    sleep(1)
    rand()+x
end

vs = [rand(Int) for i in 1:10];
cks = Any[]
for v in vs
    push!(cks, delayed(f1)(v))
end

f1(1)

## runnint time with Dagger
tic()
for k in cks
    collect(k)
end
println( @sprintf("Running time using Dagger: %.2f seconds.", toq()) )

## running time without Dagger
tic()
for v in vs
    f1(v)
end
println( @sprintf("Running time without Dagger: %.2f seconds", toq()) )

The results is as follows.
捕获


#2

I think collect blocks until it’s done and then transfers the data. So you execute this in serial, and just introduce memory transfer times from the process.
Not sure how exactly Dagger parallelizes, but this gives a speed up:

# addprocs()
using Dagger
@show workers()

@everywhere function f1(x)
    sleep(1)
    rand()+x
end

vs = [rand(Int) for i in 1:10];
cks = Any[]
for v in vs
    push!(cks, delayed(f1)(v))
end
@everywhere combine(x...) = nothing
c = delayed(combine)(cks...)
@time compute(c)

I guess because Dagger creates a Dag, and schedules independent nodes in parallel.
You should read the docs to figure out more ways to schedule something in parallel.


#3

Thanks for your help. On my computer with 4 cores, there is no speed up run from command. In fact, I have checked the manual of the document of Dagger. However, it is not clear to me. I will use the Julia standard library for parallel computing. Thanks for your help!

I tested two situations:
(1) Run from Juno
When i run the code in Juno, the code should be run with “addprocs()” before “using Dagger” first. There is no speed increase for the first run. Then, delete the code “addprocs()”, run the code again. There is speed increase.
(2) Run from the command
When I run the code with/without “addprocs()” from the command. There is no speed increase.

I think it should be a bug. I will submit the issue to the project github.


#4

With 8 processes, using a const c (because we’re timing in the global scope), and precompiling compute:

julia> const constc = c
*22*

julia> @time compute(constc)
  2.280853 seconds (4.35 k allocations: 280.203 KiB)
Dagger.Chunk{Void,MemPool.DRef}(Void, Dagger.UnitDomain(), MemPool.DRef(9, 7, 0x0000000000000000), false)

julia> @time compute(constc)
  2.009915 seconds (4.34 k allocations: 278.891 KiB)
Dagger.Chunk{Void,MemPool.DRef}(Void, Dagger.UnitDomain(), MemPool.DRef(3, 11, 0x0000000000000000), false)

julia> @time compute(constc)
  2.009728 seconds (4.36 k allocations: 279.531 KiB)
Dagger.Chunk{Void,MemPool.DRef}(Void, Dagger.UnitDomain(), MemPool.DRef(2, 34, 0x0000000000000000), false)

Looks like exactly what we were expecting.


#5

@Elrod Thanks for your swift reply! I run the code on my machine. Windows 7, 64 bit with 4 cores. There is still no speed up, even slower. The code and results are as follows,

file: test_dagger.jl

addprocs()
using Dagger
@show workers()

@everywhere function f1(x)
    sleep(1)
    rand()+x
end

vs = [rand(Int) for i in 1:10];
cks = Any[]
for v in vs
    push!(cks, delayed(f1)(v))
end

@everywhere combine(x...) = nothing
c = delayed(combine)(cks...)
const constc = c
f1(1)

## runnint time with Dagger
tic()
compute(constc)
println( @sprintf("Running time using Dagger: %.2f seconds.", toq()) )

## running time without Dagger
tic()
for v in vs
    f1(v)
end
println( @sprintf( "Running time without Dagger: %.2f seconds", toq()) )

I run the Julia file from the command. When I start Julia with many cores “julia -p 3 dagger.jl”, i comment the line “addprocs()”. The results are as follows.


#6

@zhangliye

You still need to run the function at least once to compile it.
If you put everything into a module, you can precompile it (the Julia frontend anyway; LLVM still has to go the rest of the way, so there will still be a little lag the first time).
There is also overhead from dynamic dispatches when you run things from global scope.

Using constants lets you avoid that when using @time. BenchmarkTools allows you to avoid it while benchmarking via interpolating the arguments, eg @benchmark f($x).

BenchmarkTools doesn’t print well when called from a script.

Anyway, I got:

$ julia -O3 DaggerTests.jl 
workers() = [2, 3, 4, 5, 6, 7]
Compiling time without Dagger: 10.18 seconds
Running time without Dagger: 2.48 seconds
Running time without Dagger: 2.41 seconds
Running time without Dagger: 2.40 seconds
Running time without Dagger: 2.02 seconds
Running time without Dagger: 2.02 seconds
Trial(2.018 s)
Compiling time without Dagger: 10.03 seconds
Running time without Dagger: 10.02 seconds

From:

addprocs()
using Dagger
@show workers()


@everywhere function f1(x)
    sleep(1)
    rand()+x
end

vs = [rand(Int) for i in 1:10];
cks = [delayed(f1)(v) for v in vs]

@everywhere combine(x...) = nothing
c = delayed(combine)(cks...)

##

## runnint time with Dagger
function dagger_run(c, str = "Running")
    tic()
    compute(c)
    println( @sprintf( "%s time without Dagger: %.2f seconds", str, toq()) )
end
function repeat_d(c)
    dagger_run(c, "Compiling")
    dagger_run(c, "Running")
    dagger_run(c, "Running")
    dagger_run(c, "Running")
    dagger_run(c, "Running")
    dagger_run(c, "Running")
end
repeat_d(c)

using BenchmarkTools
println( @benchmark compute($c) )

## running time without Dagger
function serial_run(f, str = "Running")
    tic()
    for v in vs
        f(v)
    end
    println( @sprintf( "%s time without Dagger: %.2f seconds", str, toq()) )
end


function repeat_s(f)
    serial_run(f, "Compiling")
    serial_run(f, "Running")
end
repeat_s(f1)

Also, crossposted.