Slower with threads

Hello everybody! My actual code takes 16 hours. So I went to learn Parallelism. I started with @task and the execution time went down to 10 hours. Very good! As I have a server with 64 cores, I thought about using @thread with: julia --threads 20 prog.jl, but execution took even longer, as in the simple example below. What is wrong? I’m not interested in the result of the sum.

# Command Line:
# $ time julia --threads 3 test.jl

using Base.Threads

const N = Int(1e8)

function test1()
# time taken: 789.3 Β± 2.8 ms
	x = 0
	v = [i for i=1:N]
	for v in v
		x += v
	end
end

function test2()
# time taken: 5407.30 Β± 0.13 ms
	x = 0
	v = [i for i=1:N]
	@threads for v in v
		x += v
	end
end

# test1()
# test2()

function my_short_real_code()
	T = [range(0.0, 2.0, 20);]
	data = zeros(0, 2)
	@threads for t in T
		params = some_function(t)
		opt = optimize(params)
		data = vcat(data, [t opt.minimum])
	end
	data # need to sort by first column
end

I think you will need a more representative example (thanks for trying to boil it down to some minimal example by the way!). The problem with this example is that

	@threads for v in v
		x += v
	end

This code has race conditions (so every thread will try to read and write to the same binding at the same time) which will make the result not reliable.
Another thing is that the actual operation may not be very costly so the cost of spawning the threads may be larger than the computation itself (the same with the third function, where all the threads write to Data).

Another thing, if the operations in some_function(t) and optimize(params) are limited by the memory of your computer, then using more threads will make it worse because now threads will be competing for the same amount of memory.

3 Likes

Your tests look very different than the my_short_real_code. In particular, it is likely that most of the time spent during the tests is for memory allocation. Additionally, the tests and my_short_real_code are not naturally parallel as written since they all try to update a single value either x or data.

The overall pattern for your code fits well into a map-reduce paradigm. Essentially the solution to your problem involves separating the β€œmap” step from the β€œreduce” step. Preferably each thread should be able function independently of the other threads. As those tasks complete, we β€œreduce” them by aggregating the results of the individual tasks together.

For the tests we can use an off the shelf solution from Folds.jl:

Below basesize influences the of individual partitions of v and thus how many threads are effectively run. The optimum happens to occur somewhere 8 and 16 threads. Threads have overhead so adding additional threads is often not better.

julia> @benchmark sum(v)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  53.697 ms … 72.191 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     56.472 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   59.054 ms Β±  5.438 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–‚ β–ˆβ–„      β–ƒ
  β–ƒβ–…β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–β–ƒβ–†β–ˆβ–…β–β–†β–…β–β–ƒβ–ƒβ–β–β–β–†β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–β–β–ƒβ–β–β–†β–β–β–…β–β–ƒβ–ƒβ–…β–ƒ ▁
  53.7 ms         Histogram: frequency by time          72 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> for nthreads in 2 .^(0:6)
           @info "Number of Threads" nthreads
           display(@benchmark(Folds.sum($v; basesize=length($v) Γ· $nthreads)))
       end
β”Œ Info: Number of Threads
β””   nthreads = 1
BenchmarkTools.Trial: 74 samples with 1 evaluation.
 Range (min … max):  64.401 ms … 80.757 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     65.238 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   67.911 ms Β±  4.422 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ ▁▇
  β–ˆβ–†β–ˆβ–ˆβ–ƒβ–†β–ƒβ–…β–ƒβ–β–β–ƒβ–β–ƒβ–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–ƒβ–β–ƒβ–β–β–β–β–ƒβ–ƒβ–β–β–ƒβ–β–ƒβ–β–β–β–ƒβ–β–β–ƒβ–β–ƒβ–„β–β–†β–„β–„β–β–β–β–ƒβ–β–ƒ ▁
  64.4 ms         Histogram: frequency by time        76.8 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.
β”Œ Info: Number of Threads
β””   nthreads = 2
BenchmarkTools.Trial: 137 samples with 1 evaluation.
 Range (min … max):  33.582 ms … 51.062 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     34.462 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   36.700 ms Β±  4.040 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–β–ˆ
  β–ƒβ–ˆβ–ˆβ–‡β–…β–ƒβ–ƒβ–β–ƒβ–‚β–„β–‚β–β–‚β–‚β–‚β–‚β–β–β–‚β–β–β–β–β–β–‚β–ƒβ–β–β–‚β–β–ƒβ–β–ƒβ–‚β–β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–β–ƒβ–‚β–β–ƒβ–β–β–‚β–β–β–‚β–β–β–β–β–β–β–‚ β–‚
  33.6 ms         Histogram: frequency by time        48.6 ms <

 Memory estimate: 720 bytes, allocs estimate: 9.
β”Œ Info: Number of Threads
β””   nthreads = 4
BenchmarkTools.Trial: 196 samples with 1 evaluation.
 Range (min … max):  19.430 ms … 89.707 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     24.721 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   25.486 ms Β±  7.196 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–ˆ
  β–„β–†β–…β–…β–„β–…β–…β–ˆβ–ˆβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–„β–„β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–‚β–β–β–β–β–β–β–‚ β–‚
  19.4 ms         Histogram: frequency by time        60.6 ms <

 Memory estimate: 2.08 KiB, allocs estimate: 25.
β”Œ Info: Number of Threads
β””   nthreads = 8
BenchmarkTools.Trial: 260 samples with 1 evaluation.
 Range (min … max):  13.052 ms … 90.392 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     16.805 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   19.236 ms Β±  8.517 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–‚β–…β–ˆβ–β–„β–…β–…β–…β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–„β–‡β–„β–„β–„β–„β–β–‡β–„β–„β–‡β–β–†β–‡β–„β–†β–β–†β–„β–„β–„β–„β–β–„β–„β–β–β–„β–β–β–„β–„β–„β–„β–β–†β–β–β–β–β–β–β–β–β–„β–„ β–†
  13.1 ms      Histogram: log(frequency) by time      50.3 ms <

 Memory estimate: 4.83 KiB, allocs estimate: 57.
β”Œ Info: Number of Threads
β””   nthreads = 16
BenchmarkTools.Trial: 231 samples with 1 evaluation.
 Range (min … max):  11.369 ms … 104.828 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     13.930 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   21.624 ms Β±  13.880 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–„      ▁                ▁
  β–ˆβ–ˆβ–ˆβ–…β–ˆβ–ˆβ–†β–†β–„β–ˆβ–ˆβ–†β–ˆβ–‡β–„β–‡β–…β–…β–ˆβ–„β–‡β–‡β–…β–„β–…β–†β–ˆβ–‡β–‡β–β–β–β–„β–„β–β–β–„β–β–β–β–β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ β–…
  11.4 ms       Histogram: log(frequency) by time      80.9 ms <

 Memory estimate: 10.39 KiB, allocs estimate: 123.
β”Œ Info: Number of Threads
β””   nthreads = 32
BenchmarkTools.Trial: 154 samples with 1 evaluation.
 Range (min … max):  11.710 ms … 124.747 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     31.366 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   32.666 ms Β±  18.949 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–‚β–‚β–…β–‚β–ƒβ–„β–ƒβ–„β–…β–„β–„β–„β–‚β–ƒβ–„β–…β–ƒβ–…β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–β–‚β–β–β–‚β–‚β–‚β–β–β–ƒβ–‚β–β–ƒβ–ƒβ–β–‚β–β–β–β–β–β–β–‚β–‚β–β–‚ β–‚
  11.7 ms         Histogram: frequency by time         82.9 ms <

 Memory estimate: 21.55 KiB, allocs estimate: 256.
β”Œ Info: Number of Threads
β””   nthreads = 64
BenchmarkTools.Trial: 96 samples with 1 evaluation.
 Range (min … max):  12.607 ms … 168.497 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     44.065 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   53.376 ms Β±  34.514 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁▄    β–β–β–ˆ ▁ β–„  β–ƒ    ▁▁  ▁
  β–ˆβ–ˆβ–†β–†β–„β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–ˆβ–‡β–‡β–ˆβ–†β–„β–†β–†β–ˆβ–ˆβ–β–„β–ˆβ–β–„β–β–β–β–β–„β–‡β–β–„β–β–β–„β–β–β–β–β–„β–β–„β–β–„β–„β–„β–„β–β–β–β–β–β–β–„β–β–β–„ ▁
  12.6 ms         Histogram: frequency by time          160 ms <

 Memory estimate: 43.98 KiB, allocs estimate: 525.

For yourself, try running your tests with a different number of threads to see how well the tests run for different numbers of threads.

For my_short_real_code I would consider reorganizing as follows:

function my_short_real_code()
	T = [range(0.0, 2.0, 20);]
	minima = zeros(length(T))
	@threads for (i, t) in collect(enumerate(T))
		params = some_function(t)
		opt = optimize(params)
		minima[i] = opt.minimum
	end
	return [T minima] # already sorted
end

Here’s another way to do test1 and test2:

using BenchmarkTools
using ThreadsX

function test1()
    N = Int(1e8)
    reduce(+, 1:N)
end

function test2()
    N = Int(1e8)
    ThreadsX.reduce(+, 1:N)
end

@btime(test1())
# result
  683.300 ΞΌs (0 allocations: 0 bytes)
5000000050000000

@btime(test2())
# result
  152.400 ΞΌs (2166 allocations: 172.11 KiB)
5000000050000000

FYI:

julia> using Base.Threads

julia> nthreads()
32
1 Like

Thanks, I’ll see if I can make another example, already taking advantage of some suggestions. The server has a lot of memory.

I made a different example to be simpler. Anyway, shouldn’t parallelization work in this simpler example? From my way of thinking, it seems I still have a lot to learn. I will test your suggestion in my_short_real_code in real code and post the result.

Thanks, I’ll take a look and see how I use it.

Threads are not a free lunch. It takes time to start a thread. It takes time to divide up the problem between the threads. What one has to balance is the overhead of parallelization versus the benefit of parallelization.

In the simple example, note that there is an underlying level of parallelization in the processor’s SIMD instructions. Modern processor instructions can process multiple numbers at once on the same core. Dividing up a problem too much can interfere with this.

4 Likes

I fixed it using your tip. Now the threads are independent, they got faster, but they still spend more time.

I also have a Task-only version, and yes, it is faster than the single-threaded version.

My actual execution:
Using @task: 10h
Normal: 15h
Using @threads: 26h

Frankly, what this seems to indicate is that the limiting factor may actually be memory allocation. You may not have enough memory to effectively run more than one job at once.

On my notebook probably, but in this case, I’m using an Ubuntu 20.04.4 LTS server, with 64 cores and 252GB (ram), running only my Julia program and I’m not able to take advantage of parallelization.

It is not possible to select an answer as a solution. For now I need to learn more about it. Thank you all for your help.

The best way to progress concretely would to be to provide an executable minimum working example demonstrating the issue.

A more representative example.

using LinearAlgebra

if !@isdefined Οƒz
	const π•š = im
	const ket0 = ComplexF64[1, 0]
	const ket1 = ComplexF64[0, 1]
	const Οƒx = ComplexF64[0  1; 1  0]
	const Οƒz = ComplexF64[1  0; 0 -1]
end

# Runge Kutta 4th order
# dy/dt = f(t,y), y(t0) = y0
function rungekutta4(t0, y0, t, f::Function)
	h = 1e-3
	n = round(Int64, (t-t0)/h)
	h = (t - t0)/n
	h_2, h_6 = h/2.0, h/6.0
	tn, yn = t0, copy(y0)
	for i=1:n
		k1 = f(tn, yn)
		k2 = f(tn + h_2, yn + k1*h_2)
		k3 = f(tn + h_2, yn + k2*h_2)
		k4 = f(tn + h, yn + k3*h)
		yn += h_6*(k1 + 2*k2 + 2*k3 + k4)
		tn += h
	end
	yn
end

# Fidelity of quantum states ρ1, ρ2
function fidelity(ρ1, ρ2)
	f = (tr(√( √ρ1 * ρ2 * √ρ1 )))^2
	real(f)
end

# Evolves states ρ during time T
# dρ/dt = -i[H,ρ] = f(t,ρ)
function evol(ρ, T)
	H(t) = Οƒz + sin(2Ο€*t)*Οƒx
	f(t,ρ) = -π•š*(H(t)*ρ - ρ*H(t))
	rungekutta4(0.0, ρ, T, f)
end

# Fidelity in T
function fid(T)
	ρini = [0.4 0.0; 0.0 0.6] # initial state
	ρtgt = [0.6 0.0; 0.0 0.4] # target state
	ρevo = evol(ρini, T)      # evolved states
	fidelity(ρtgt, ρevo)      # ρevo β‰ˆ? ρtgt 
end

# using PyPlot
using Base.Threads

# Fidelity versus T
function fid_versus_T()
	T = [0.0:0.01:3.0;]
	F = zeros(length(T))
	@threads for (i,T) in collect(enumerate(T))
		F[i] = fid(T)
	end
	m,i = findmax(F)
	println("max=($(T[i]), $m)")
	# plot(T, F)
	# xlabel("T"); ylabel("fid")
end

fid_versus_T()

# 1 threads: julia
# BenchmarkTools.Trial: 4 samples with 1 evaluation.
# Range (min … max):  1.436 s …   1.480 s  β”Š GC (min … max): 4.20% … 4.18%
# Time  (median):     1.470 s              β”Š GC (median):    4.23%
# Time  (mean Β± Οƒ):   1.464 s Β± 19.550 ms  β”Š GC (mean Β± Οƒ):  4.37% Β± 0.33%

# 8 threads: julia -t8
# BenchmarkTools.Trial: 4 samples with 1 evaluation.
# Range (min … max):  1.473 s …   1.501 s  β”Š GC (min … max): 51.69% … 52.77%
# Time  (median):     1.484 s              β”Š GC (median):    52.11%
# Time  (mean Β± Οƒ):   1.485 s Β± 12.120 ms  β”Š GC (mean Β± Οƒ):  52.17% Β±  0.47%
1 Like

In this case it was really easy to optimize, you just need to replace your arrays with StaticArrays.jl versions (using @SArray). Because they are small and of fixed size, and you don’t even mutate them, I didn’t have to do any other adjustments and the code went from 1.975806 seconds (19.43 M allocations: 2.316 GiB, 12.19% gc time) to 0.074329 seconds (25 allocations: 11.594 KiB).

module TestModule

    using LinearAlgebra
    using StaticArrays

    if !@isdefined Οƒz
        const π•š = im
        const ket0 = @SArray ComplexF64[1, 0]
        const ket1 = @SArray ComplexF64[0, 1]
        const Οƒx = @SArray ComplexF64[0  1; 1  0]
        const Οƒz = @SArray ComplexF64[1  0; 0 -1]
    end

    # Runge Kutta 4th order
    # dy/dt = f(t,y), y(t0) = y0
    function rungekutta4(t0, y0, t, f::Function)
        h = 1e-3
        n = round(Int64, (t-t0)/h)
        h = (t - t0)/n
        h_2, h_6 = h/2.0, h/6.0
        tn, yn = t0, copy(y0)
        for i=1:n
            k1 = f(tn, yn)
            k2 = f(tn + h_2, yn + k1*h_2)
            k3 = f(tn + h_2, yn + k2*h_2)
            k4 = f(tn + h, yn + k3*h)
            yn += h_6*(k1 + 2*k2 + 2*k3 + k4)
            tn += h
        end
        yn
    end

    # Fidelity of quantum states ρ1, ρ2
    function fidelity(ρ1, ρ2)
        f = (tr(√( √ρ1 * ρ2 * √ρ1 )))^2
        real(f)
    end

    # Evolves states ρ during time T
    # dρ/dt = -i[H,ρ] = f(t,ρ)
    function evol(ρ, T)
        H(t) = Οƒz + sin(2Ο€*t)*Οƒx
        f(t,ρ) = -π•š*(H(t)*ρ - ρ*H(t))
        rungekutta4(0.0, ρ, T, f)
    end

    # Fidelity in T
    function fid(T)
        ρini = @SArray [0.4 0.0; 0.0 0.6] # initial state
        ρtgt = @SArray [0.6 0.0; 0.0 0.4] # target state
        ρevo = evol(ρini, T)      # evolved states
        fidelity(ρtgt, ρevo)      # ρevo β‰ˆ? ρtgt 
    end

    # using PyPlot
    using Base.Threads

    # Fidelity versus T
    function fid_versus_T()
        T = [0.0:0.01:3.0;]
        F = zeros(length(T))
        @threads for (i,T) in collect(enumerate(T))
            F[i] = fid(T)
        end
        m,i = findmax(F)
        println("max=($(T[i]), $m)")
        # plot(T, F)
        # xlabel("T"); ylabel("fid")
    end
end
1 Like

The parallelization effect is still small, but the use of @SArray was amazing. It is worth rewriting some code using @SArray. I will still wait if someone talks a bit more about parallelization before I close the topic. Thanks a lot.

Before parallelization was difficult because the timing was dominated by memory allocation time. Now the time is so short that we will likely need low-overhead threads as in Polyester.jl.

2 Likes

Soon I will do the long duration test and post the result. Just remembering that I am looking to optimize a run from 10h to 48h.

I don’t want to be an inconvenience, but I would like to ask three questions: 1) For this optimization to work, do all arrays have to be static? 2) In some cases, I have to make copies of the StaticArrays and I need to modify them and then I crash! What to do in this case? 3) SparseArrays generate similar optimizations.

Do SMatrix and MMatrix have the same performance?