Hello everybody! My actual code takes 16 hours. So I went to learn Parallelism. I started with @task and the execution time went down to 10 hours. Very good! As I have a server with 64 cores, I thought about using @thread with: julia --threads 20 prog.jl, but execution took even longer, as in the simple example below. What is wrong? Iβm not interested in the result of the sum.
# Command Line:
# $ time julia --threads 3 test.jl
using Base.Threads
const N = Int(1e8)
function test1()
# time taken: 789.3 Β± 2.8 ms
x = 0
v = [i for i=1:N]
for v in v
x += v
end
end
function test2()
# time taken: 5407.30 Β± 0.13 ms
x = 0
v = [i for i=1:N]
@threads for v in v
x += v
end
end
# test1()
# test2()
function my_short_real_code()
T = [range(0.0, 2.0, 20);]
data = zeros(0, 2)
@threads for t in T
params = some_function(t)
opt = optimize(params)
data = vcat(data, [t opt.minimum])
end
data # need to sort by first column
end
I think you will need a more representative example (thanks for trying to boil it down to some minimal example by the way!). The problem with this example is that
@threads for v in v
x += v
end
This code has race conditions (so every thread will try to read and write to the same binding at the same time) which will make the result not reliable.
Another thing is that the actual operation may not be very costly so the cost of spawning the threads may be larger than the computation itself (the same with the third function, where all the threads write to Data).
Another thing, if the operations in some_function(t) and optimize(params) are limited by the memory of your computer, then using more threads will make it worse because now threads will be competing for the same amount of memory.
Your tests look very different than the my_short_real_code. In particular, it is likely that most of the time spent during the tests is for memory allocation. Additionally, the tests and my_short_real_code are not naturally parallel as written since they all try to update a single value either x or data.
The overall pattern for your code fits well into a map-reduce paradigm. Essentially the solution to your problem involves separating the βmapβ step from the βreduceβ step. Preferably each thread should be able function independently of the other threads. As those tasks complete, we βreduceβ them by aggregating the results of the individual tasks together.
For the tests we can use an off the shelf solution from Folds.jl:
Below basesize influences the of individual partitions of v and thus how many threads are effectively run. The optimum happens to occur somewhere 8 and 16 threads. Threads have overhead so adding additional threads is often not better.
julia> @benchmark sum(v)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
Range (min β¦ max): 53.697 ms β¦ 72.191 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 56.472 ms β GC (median): 0.00%
Time (mean Β± Ο): 59.054 ms Β± 5.438 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββ β
ββ βββββββββββββ βββ ββββββββββββββββββββββββββββββββββββ ββββ β β
53.7 ms Histogram: frequency by time 72 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> for nthreads in 2 .^(0:6)
@info "Number of Threads" nthreads
display(@benchmark(Folds.sum($v; basesize=length($v) Γ· $nthreads)))
end
β Info: Number of Threads
β nthreads = 1
BenchmarkTools.Trial: 74 samples with 1 evaluation.
Range (min β¦ max): 64.401 ms β¦ 80.757 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 65.238 ms β GC (median): 0.00%
Time (mean Β± Ο): 67.911 ms Β± 4.422 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββ
ββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββ β
64.4 ms Histogram: frequency by time 76.8 ms <
Memory estimate: 16 bytes, allocs estimate: 1.
β Info: Number of Threads
β nthreads = 2
BenchmarkTools.Trial: 137 samples with 1 evaluation.
Range (min β¦ max): 33.582 ms β¦ 51.062 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 34.462 ms β GC (median): 0.00%
Time (mean Β± Ο): 36.700 ms Β± 4.040 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
33.6 ms Histogram: frequency by time 48.6 ms <
Memory estimate: 720 bytes, allocs estimate: 9.
β Info: Number of Threads
β nthreads = 4
BenchmarkTools.Trial: 196 samples with 1 evaluation.
Range (min β¦ max): 19.430 ms β¦ 89.707 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 24.721 ms β GC (median): 0.00%
Time (mean Β± Ο): 25.486 ms Β± 7.196 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ β ββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
19.4 ms Histogram: frequency by time 60.6 ms <
Memory estimate: 2.08 KiB, allocs estimate: 25.
β Info: Number of Threads
β nthreads = 8
BenchmarkTools.Trial: 260 samples with 1 evaluation.
Range (min β¦ max): 13.052 ms β¦ 90.392 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 16.805 ms β GC (median): 0.00%
Time (mean Β± Ο): 19.236 ms Β± 8.517 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββ ββββ β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
13.1 ms Histogram: log(frequency) by time 50.3 ms <
Memory estimate: 4.83 KiB, allocs estimate: 57.
β Info: Number of Threads
β nthreads = 16
BenchmarkTools.Trial: 231 samples with 1 evaluation.
Range (min β¦ max): 11.369 ms β¦ 104.828 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 13.930 ms β GC (median): 0.00%
Time (mean Β± Ο): 21.624 ms Β± 13.880 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ β β
ββββ βββββββββββββ β βββββ ββ βββββββββββββββββββββββββββββββββββ β
11.4 ms Histogram: log(frequency) by time 80.9 ms <
Memory estimate: 10.39 KiB, allocs estimate: 123.
β Info: Number of Threads
β nthreads = 32
BenchmarkTools.Trial: 154 samples with 1 evaluation.
Range (min β¦ max): 11.710 ms β¦ 124.747 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 31.366 ms β GC (median): 0.00%
Time (mean Β± Ο): 32.666 ms Β± 18.949 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββββββ ββββββ βββββββ ββ ββββββββββββββββββββββββββββββββ β
11.7 ms Histogram: frequency by time 82.9 ms <
Memory estimate: 21.55 KiB, allocs estimate: 256.
β Info: Number of Threads
β nthreads = 64
BenchmarkTools.Trial: 96 samples with 1 evaluation.
Range (min β¦ max): 12.607 ms β¦ 168.497 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 44.065 ms β GC (median): 0.00%
Time (mean Β± Ο): 53.376 ms Β± 34.514 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ βββ β β β ββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
12.6 ms Histogram: frequency by time 160 ms <
Memory estimate: 43.98 KiB, allocs estimate: 525.
For yourself, try running your tests with a different number of threads to see how well the tests run for different numbers of threads.
For my_short_real_code I would consider reorganizing as follows:
function my_short_real_code()
T = [range(0.0, 2.0, 20);]
minima = zeros(length(T))
@threads for (i, t) in collect(enumerate(T))
params = some_function(t)
opt = optimize(params)
minima[i] = opt.minimum
end
return [T minima] # already sorted
end
using BenchmarkTools
using ThreadsX
function test1()
N = Int(1e8)
reduce(+, 1:N)
end
function test2()
N = Int(1e8)
ThreadsX.reduce(+, 1:N)
end
@btime(test1())
# result
683.300 ΞΌs (0 allocations: 0 bytes)
5000000050000000
@btime(test2())
# result
152.400 ΞΌs (2166 allocations: 172.11 KiB)
5000000050000000
I made a different example to be simpler. Anyway, shouldnβt parallelization work in this simpler example? From my way of thinking, it seems I still have a lot to learn. I will test your suggestion in my_short_real_code in real code and post the result.
Threads are not a free lunch. It takes time to start a thread. It takes time to divide up the problem between the threads. What one has to balance is the overhead of parallelization versus the benefit of parallelization.
In the simple example, note that there is an underlying level of parallelization in the processorβs SIMD instructions. Modern processor instructions can process multiple numbers at once on the same core. Dividing up a problem too much can interfere with this.
Frankly, what this seems to indicate is that the limiting factor may actually be memory allocation. You may not have enough memory to effectively run more than one job at once.
On my notebook probably, but in this case, Iβm using an Ubuntu 20.04.4 LTS server, with 64 cores and 252GB (ram), running only my Julia program and Iβm not able to take advantage of parallelization.
In this case it was really easy to optimize, you just need to replace your arrays with StaticArrays.jl versions (using @SArray). Because they are small and of fixed size, and you donβt even mutate them, I didnβt have to do any other adjustments and the code went from 1.975806 seconds (19.43 M allocations: 2.316 GiB, 12.19% gc time) to 0.074329 seconds (25 allocations: 11.594 KiB).
module TestModule
using LinearAlgebra
using StaticArrays
if !@isdefined Οz
const π = im
const ket0 = @SArray ComplexF64[1, 0]
const ket1 = @SArray ComplexF64[0, 1]
const Οx = @SArray ComplexF64[0 1; 1 0]
const Οz = @SArray ComplexF64[1 0; 0 -1]
end
# Runge Kutta 4th order
# dy/dt = f(t,y), y(t0) = y0
function rungekutta4(t0, y0, t, f::Function)
h = 1e-3
n = round(Int64, (t-t0)/h)
h = (t - t0)/n
h_2, h_6 = h/2.0, h/6.0
tn, yn = t0, copy(y0)
for i=1:n
k1 = f(tn, yn)
k2 = f(tn + h_2, yn + k1*h_2)
k3 = f(tn + h_2, yn + k2*h_2)
k4 = f(tn + h, yn + k3*h)
yn += h_6*(k1 + 2*k2 + 2*k3 + k4)
tn += h
end
yn
end
# Fidelity of quantum states Ο1, Ο2
function fidelity(Ο1, Ο2)
f = (tr(β( βΟ1 * Ο2 * βΟ1 )))^2
real(f)
end
# Evolves states Ο during time T
# dΟ/dt = -i[H,Ο] = f(t,Ο)
function evol(Ο, T)
H(t) = Οz + sin(2Ο*t)*Οx
f(t,Ο) = -π*(H(t)*Ο - Ο*H(t))
rungekutta4(0.0, Ο, T, f)
end
# Fidelity in T
function fid(T)
Οini = @SArray [0.4 0.0; 0.0 0.6] # initial state
Οtgt = @SArray [0.6 0.0; 0.0 0.4] # target state
Οevo = evol(Οini, T) # evolved states
fidelity(Οtgt, Οevo) # Οevo β? Οtgt
end
# using PyPlot
using Base.Threads
# Fidelity versus T
function fid_versus_T()
T = [0.0:0.01:3.0;]
F = zeros(length(T))
@threads for (i,T) in collect(enumerate(T))
F[i] = fid(T)
end
m,i = findmax(F)
println("max=($(T[i]), $m)")
# plot(T, F)
# xlabel("T"); ylabel("fid")
end
end
The parallelization effect is still small, but the use of @SArray was amazing. It is worth rewriting some code using @SArray. I will still wait if someone talks a bit more about parallelization before I close the topic. Thanks a lot.
Before parallelization was difficult because the timing was dominated by memory allocation time. Now the time is so short that we will likely need low-overhead threads as in Polyester.jl.
I donβt want to be an inconvenience, but I would like to ask three questions: 1) For this optimization to work, do all arrays have to be static? 2) In some cases, I have to make copies of the StaticArrays and I need to modify them and then I crash! What to do in this case? 3) SparseArrays generate similar optimizations.