Slower with threads

marllos · July 21, 2022, 6:49pm

Hello everybody! My actual code takes 16 hours. So I went to learn Parallelism. I started with @task and the execution time went down to 10 hours. Very good! As I have a server with 64 cores, I thought about using @thread with: julia --threads 20 prog.jl, but execution took even longer, as in the simple example below. What is wrong? I’m not interested in the result of the sum.

# Command Line:
# $ time julia --threads 3 test.jl

using Base.Threads

const N = Int(1e8)

function test1()
# time taken: 789.3 ± 2.8 ms
	x = 0
	v = [i for i=1:N]
	for v in v
		x += v
	end
end

function test2()
# time taken: 5407.30 ± 0.13 ms
	x = 0
	v = [i for i=1:N]
	@threads for v in v
		x += v
	end
end

# test1()
# test2()

function my_short_real_code()
	T = [range(0.0, 2.0, 20);]
	data = zeros(0, 2)
	@threads for t in T
		params = some_function(t)
		opt = optimize(params)
		data = vcat(data, [t opt.minimum])
	end
	data # need to sort by first column
end

aramirezreyes · July 21, 2022, 7:33pm

I think you will need a more representative example (thanks for trying to boil it down to some minimal example by the way!). The problem with this example is that

	@threads for v in v
		x += v
	end

This code has race conditions (so every thread will try to read and write to the same binding at the same time) which will make the result not reliable.
Another thing is that the actual operation may not be very costly so the cost of spawning the threads may be larger than the computation itself (the same with the third function, where all the threads write to Data).

Another thing, if the operations in some_function(t) and optimize(params) are limited by the memory of your computer, then using more threads will make it worse because now threads will be competing for the same amount of memory.

mkitti · July 21, 2022, 8:21pm

Your tests look very different than the my_short_real_code. In particular, it is likely that most of the time spent during the tests is for memory allocation. Additionally, the tests and my_short_real_code are not naturally parallel as written since they all try to update a single value either x or data.

The overall pattern for your code fits well into a map-reduce paradigm. Essentially the solution to your problem involves separating the “map” step from the “reduce” step. Preferably each thread should be able function independently of the other threads. As those tasks complete, we “reduce” them by aggregating the results of the individual tasks together.

For the tests we can use an off the shelf solution from Folds.jl:

Below basesize influences the of individual partitions of v and thus how many threads are effectively run. The optimum happens to occur somewhere 8 and 16 threads. Threads have overhead so adding additional threads is often not better.

julia> @benchmark sum(v)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  53.697 ms … 72.191 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.472 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.054 ms ±  5.438 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂ █▄      ▃
  ▃▅▃██████▃▁▃▆█▅▁▆▅▁▃▃▁▁▁▆▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▃▁▁▆▁▁▅▁▃▃▅▃ ▁
  53.7 ms         Histogram: frequency by time          72 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> for nthreads in 2 .^(0:6)
           @info "Number of Threads" nthreads
           display(@benchmark(Folds.sum($v; basesize=length($v) ÷ $nthreads)))
       end
┌ Info: Number of Threads
└   nthreads = 1
BenchmarkTools.Trial: 74 samples with 1 evaluation.
 Range (min … max):  64.401 ms … 80.757 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.238 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.911 ms ±  4.422 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▇
  █▆██▃▆▃▅▃▁▁▃▁▃▁▁▁▁▃▁▁▃▁▁▁▃▁▃▁▁▁▁▃▃▁▁▃▁▃▁▁▁▃▁▁▃▁▃▄▁▆▄▄▁▁▁▃▁▃ ▁
  64.4 ms         Histogram: frequency by time        76.8 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.
┌ Info: Number of Threads
└   nthreads = 2
BenchmarkTools.Trial: 137 samples with 1 evaluation.
 Range (min … max):  33.582 ms … 51.062 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.462 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.700 ms ±  4.040 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁█
  ▃██▇▅▃▃▁▃▂▄▂▁▂▂▂▂▁▁▂▁▁▁▁▁▂▃▁▁▂▁▃▁▃▂▁▃▃▃▃▂▁▃▂▁▃▁▁▂▁▁▂▁▁▁▁▁▁▂ ▂
  33.6 ms         Histogram: frequency by time        48.6 ms <

 Memory estimate: 720 bytes, allocs estimate: 9.
┌ Info: Number of Threads
└   nthreads = 4
BenchmarkTools.Trial: 196 samples with 1 evaluation.
 Range (min … max):  19.430 ms … 89.707 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.721 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.486 ms ±  7.196 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

         █
  ▄▆▅▅▄▅▅██▃▃▄▃▃▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▁▁▁▁▁▁▂ ▂
  19.4 ms         Histogram: frequency by time        60.6 ms <

 Memory estimate: 2.08 KiB, allocs estimate: 25.
┌ Info: Number of Threads
└   nthreads = 8
BenchmarkTools.Trial: 260 samples with 1 evaluation.
 Range (min … max):  13.052 ms … 90.392 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.805 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.236 ms ±  8.517 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▂▅█▁▄▅▅▅▃
  ███████████▇▄▄▇▄▄▄▄▁▇▄▄▇▁▆▇▄▆▁▆▄▄▄▄▁▄▄▁▁▄▁▁▄▄▄▄▁▆▁▁▁▁▁▁▁▁▄▄ ▆
  13.1 ms      Histogram: log(frequency) by time      50.3 ms <

 Memory estimate: 4.83 KiB, allocs estimate: 57.
┌ Info: Number of Threads
└   nthreads = 16
BenchmarkTools.Trial: 231 samples with 1 evaluation.
 Range (min … max):  11.369 ms … 104.828 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.930 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.624 ms ±  13.880 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▄      ▁                ▁
  ███▅██▆▆▄██▆█▇▄▇▅▅█▄▇▇▅▄▅▆█▇▇▁▁▁▄▄▁▁▄▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
  11.4 ms       Histogram: log(frequency) by time      80.9 ms <

 Memory estimate: 10.39 KiB, allocs estimate: 123.
┌ Info: Number of Threads
└   nthreads = 32
BenchmarkTools.Trial: 154 samples with 1 evaluation.
 Range (min … max):  11.710 ms … 124.747 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.366 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.666 ms ±  18.949 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▃▃▄▃▃▃▃▄▄▂▂▅▂▃▄▃▄▅▄▄▄▂▃▄▅▃▅▃▃▃▂▃▃▁▂▁▁▂▂▂▁▁▃▂▁▃▃▁▂▁▁▁▁▁▁▂▂▁▂ ▂
  11.7 ms         Histogram: frequency by time         82.9 ms <

 Memory estimate: 21.55 KiB, allocs estimate: 256.
┌ Info: Number of Threads
└   nthreads = 64
BenchmarkTools.Trial: 96 samples with 1 evaluation.
 Range (min … max):  12.607 ms … 168.497 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.065 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.376 ms ±  34.514 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▄    ▁▁█ ▁ ▄  ▃    ▁▁  ▁
  ██▆▆▄▇███▇█▆█▇▇█▆▄▆▆██▁▄█▁▄▁▁▁▁▄▇▁▄▁▁▄▁▁▁▁▄▁▄▁▄▄▄▄▁▁▁▁▁▁▄▁▁▄ ▁
  12.6 ms         Histogram: frequency by time          160 ms <

 Memory estimate: 43.98 KiB, allocs estimate: 525.

For yourself, try running your tests with a different number of threads to see how well the tests run for different numbers of threads.

For my_short_real_code I would consider reorganizing as follows:

function my_short_real_code()
	T = [range(0.0, 2.0, 20);]
	minima = zeros(length(T))
	@threads for (i, t) in collect(enumerate(T))
		params = some_function(t)
		opt = optimize(params)
		minima[i] = opt.minimum
	end
	return [T minima] # already sorted
end

mthelm85 · July 21, 2022, 8:25pm

Here’s another way to do test1 and test2:

using BenchmarkTools
using ThreadsX

function test1()
    N = Int(1e8)
    reduce(+, 1:N)
end

function test2()
    N = Int(1e8)
    ThreadsX.reduce(+, 1:N)
end

@btime(test1())
# result
  683.300 μs (0 allocations: 0 bytes)
5000000050000000

@btime(test2())
# result
  152.400 μs (2166 allocations: 172.11 KiB)
5000000050000000

FYI:

julia> using Base.Threads

julia> nthreads()
32

marllos · July 21, 2022, 8:51pm

Thanks, I’ll see if I can make another example, already taking advantage of some suggestions. The server has a lot of memory.

marllos · July 21, 2022, 8:52pm

I made a different example to be simpler. Anyway, shouldn’t parallelization work in this simpler example? From my way of thinking, it seems I still have a lot to learn. I will test your suggestion in my_short_real_code in real code and post the result.

marllos · July 21, 2022, 8:53pm

Thanks, I’ll take a look and see how I use it.

mkitti · July 21, 2022, 8:58pm

Threads are not a free lunch. It takes time to start a thread. It takes time to divide up the problem between the threads. What one has to balance is the overhead of parallelization versus the benefit of parallelization.

In the simple example, note that there is an underlying level of parallelization in the processor’s SIMD instructions. Modern processor instructions can process multiple numbers at once on the same core. Dividing up a problem too much can interfere with this.

marllos · July 23, 2022, 3:00am

I fixed it using your tip. Now the threads are independent, they got faster, but they still spend more time.

I also have a Task-only version, and yes, it is faster than the single-threaded version.

My actual execution:
Using @task: 10h
Normal: 15h
Using @threads: 26h

mkitti · July 23, 2022, 3:27am

Frankly, what this seems to indicate is that the limiting factor may actually be memory allocation. You may not have enough memory to effectively run more than one job at once.

marllos · July 23, 2022, 2:34pm

On my notebook probably, but in this case, I’m using an Ubuntu 20.04.4 LTS server, with 64 cores and 252GB (ram), running only my Julia program and I’m not able to take advantage of parallelization.

marllos · July 29, 2022, 4:21pm

It is not possible to select an answer as a solution. For now I need to learn more about it. Thank you all for your help.

mkitti · July 29, 2022, 7:03pm

The best way to progress concretely would to be to provide an executable minimum working example demonstrating the issue.

marllos · August 4, 2022, 5:36pm

A more representative example.

using LinearAlgebra

if !@isdefined σz
	const 𝕚 = im
	const ket0 = ComplexF64[1, 0]
	const ket1 = ComplexF64[0, 1]
	const σx = ComplexF64[0  1; 1  0]
	const σz = ComplexF64[1  0; 0 -1]
end

# Runge Kutta 4th order
# dy/dt = f(t,y), y(t0) = y0
function rungekutta4(t0, y0, t, f::Function)
	h = 1e-3
	n = round(Int64, (t-t0)/h)
	h = (t - t0)/n
	h_2, h_6 = h/2.0, h/6.0
	tn, yn = t0, copy(y0)
	for i=1:n
		k1 = f(tn, yn)
		k2 = f(tn + h_2, yn + k1*h_2)
		k3 = f(tn + h_2, yn + k2*h_2)
		k4 = f(tn + h, yn + k3*h)
		yn += h_6*(k1 + 2*k2 + 2*k3 + k4)
		tn += h
	end
	yn
end

# Fidelity of quantum states ρ1, ρ2
function fidelity(ρ1, ρ2)
	f = (tr(√( √ρ1 * ρ2 * √ρ1 )))^2
	real(f)
end

# Evolves states ρ during time T
# dρ/dt = -i[H,ρ] = f(t,ρ)
function evol(ρ, T)
	H(t) = σz + sin(2π*t)*σx
	f(t,ρ) = -𝕚*(H(t)*ρ - ρ*H(t))
	rungekutta4(0.0, ρ, T, f)
end

# Fidelity in T
function fid(T)
	ρini = [0.4 0.0; 0.0 0.6] # initial state
	ρtgt = [0.6 0.0; 0.0 0.4] # target state
	ρevo = evol(ρini, T)      # evolved states
	fidelity(ρtgt, ρevo)      # ρevo ≈? ρtgt 
end

# using PyPlot
using Base.Threads

# Fidelity versus T
function fid_versus_T()
	T = [0.0:0.01:3.0;]
	F = zeros(length(T))
	@threads for (i,T) in collect(enumerate(T))
		F[i] = fid(T)
	end
	m,i = findmax(F)
	println("max=($(T[i]), $m)")
	# plot(T, F)
	# xlabel("T"); ylabel("fid")
end

fid_versus_T()

# 1 threads: julia
# BenchmarkTools.Trial: 4 samples with 1 evaluation.
# Range (min … max):  1.436 s …   1.480 s  ┊ GC (min … max): 4.20% … 4.18%
# Time  (median):     1.470 s              ┊ GC (median):    4.23%
# Time  (mean ± σ):   1.464 s ± 19.550 ms  ┊ GC (mean ± σ):  4.37% ± 0.33%

# 8 threads: julia -t8
# BenchmarkTools.Trial: 4 samples with 1 evaluation.
# Range (min … max):  1.473 s …   1.501 s  ┊ GC (min … max): 51.69% … 52.77%
# Time  (median):     1.484 s              ┊ GC (median):    52.11%
# Time  (mean ± σ):   1.485 s ± 12.120 ms  ┊ GC (mean ± σ):  52.17% ±  0.47%

jules · August 4, 2022, 8:28pm

In this case it was really easy to optimize, you just need to replace your arrays with StaticArrays.jl versions (using @SArray). Because they are small and of fixed size, and you don’t even mutate them, I didn’t have to do any other adjustments and the code went from 1.975806 seconds (19.43 M allocations: 2.316 GiB, 12.19% gc time) to 0.074329 seconds (25 allocations: 11.594 KiB).

module TestModule

    using LinearAlgebra
    using StaticArrays

    if !@isdefined σz
        const 𝕚 = im
        const ket0 = @SArray ComplexF64[1, 0]
        const ket1 = @SArray ComplexF64[0, 1]
        const σx = @SArray ComplexF64[0  1; 1  0]
        const σz = @SArray ComplexF64[1  0; 0 -1]
    end

    # Runge Kutta 4th order
    # dy/dt = f(t,y), y(t0) = y0
    function rungekutta4(t0, y0, t, f::Function)
        h = 1e-3
        n = round(Int64, (t-t0)/h)
        h = (t - t0)/n
        h_2, h_6 = h/2.0, h/6.0
        tn, yn = t0, copy(y0)
        for i=1:n
            k1 = f(tn, yn)
            k2 = f(tn + h_2, yn + k1*h_2)
            k3 = f(tn + h_2, yn + k2*h_2)
            k4 = f(tn + h, yn + k3*h)
            yn += h_6*(k1 + 2*k2 + 2*k3 + k4)
            tn += h
        end
        yn
    end

    # Fidelity of quantum states ρ1, ρ2
    function fidelity(ρ1, ρ2)
        f = (tr(√( √ρ1 * ρ2 * √ρ1 )))^2
        real(f)
    end

    # Evolves states ρ during time T
    # dρ/dt = -i[H,ρ] = f(t,ρ)
    function evol(ρ, T)
        H(t) = σz + sin(2π*t)*σx
        f(t,ρ) = -𝕚*(H(t)*ρ - ρ*H(t))
        rungekutta4(0.0, ρ, T, f)
    end

    # Fidelity in T
    function fid(T)
        ρini = @SArray [0.4 0.0; 0.0 0.6] # initial state
        ρtgt = @SArray [0.6 0.0; 0.0 0.4] # target state
        ρevo = evol(ρini, T)      # evolved states
        fidelity(ρtgt, ρevo)      # ρevo ≈? ρtgt 
    end

    # using PyPlot
    using Base.Threads

    # Fidelity versus T
    function fid_versus_T()
        T = [0.0:0.01:3.0;]
        F = zeros(length(T))
        @threads for (i,T) in collect(enumerate(T))
            F[i] = fid(T)
        end
        m,i = findmax(F)
        println("max=($(T[i]), $m)")
        # plot(T, F)
        # xlabel("T"); ylabel("fid")
    end
end

marllos · August 4, 2022, 9:58pm

The parallelization effect is still small, but the use of @SArray was amazing. It is worth rewriting some code using @SArray. I will still wait if someone talks a bit more about parallelization before I close the topic. Thanks a lot.

mkitti · August 5, 2022, 2:17am

Before parallelization was difficult because the timing was dominated by memory allocation time. Now the time is so short that we will likely need low-overhead threads as in Polyester.jl.

marllos · August 5, 2022, 2:23pm

Soon I will do the long duration test and post the result. Just remembering that I am looking to optimize a run from 10h to 48h.

marllos · August 5, 2022, 2:43pm

I don’t want to be an inconvenience, but I would like to ask three questions: 1) For this optimization to work, do all arrays have to be static? 2) In some cases, I have to make copies of the StaticArrays and I need to modify them and then I crash! What to do in this case? 3) SparseArrays generate similar optimizations.

marllos · August 5, 2022, 3:22pm

Do SMatrix and MMatrix have the same performance?

Topic		Replies	Views
Question for lower performance by using @threads in for loop New to Julia question	13	1033	July 9, 2021
How to speed up tasks? Performance	8	795	September 23, 2020
More threads, slower code, even if not spawning them Performance	19	814	January 29, 2022
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	723	August 13, 2020
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	670	October 7, 2021

Slower with threads

Related topics