Hi,
I’m working on a code that needs to calculate many, simple arithmetic on slices of large arrays.
Ideally I’d like to understand how to speed up this code and explored multiple approaches:
function sub_single(x1, x2, res)
@simd for i ∈ eachindex(res)
@inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
end
end
function sub_threaded(x1, x2, res)
Threads.@threads for i ∈ eachindex(res)
@inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
end
end
# Set up data arrays
num_ptl = 400_000
num_chunk = 40_000
N_time = 10_000
Δ = 10
# Big base arrays
x1 = randn(num_ptl, N_time);
x2 = randn(num_ptl, N_time);
y1 = zeros(size(x1));
# Transpose arrays with permutedim, which re-arranges memory
x3 = permutedims(x1);
x4 = permutedims(x2);
y3 = zeros(size(x3));
# We work on random selections
p_ix = rand(1:num_ptl, num_chunk);
# Use views for quick access
x1_w = view(x1, p_ix, 1:N_time-Δ);
x2_w = view(x2, p_ix, 1+Δ:N_time);
y1_w = view(y1, p_ix, 1:N_time-Δ);
# Views on transposed arrays
x3_w = view(x3, 1:N_time-Δ, p_ix);
x4_w = view(x4, 1+Δ:N_time, p_ix);
y3_w = view(y3, 1:N_time-Δ, p_ix);
@benchmark sub_single(x1_w, x2_w, y1_w)
@benchmark sub_threaded(x1_w, x2_w, y1_w)
@benchmark sub_array(x1_w, x2_w, y1_w)
"""
Test if array ordering has an effect
- Yes, about 10x.
"""
@benchmark sub_single(x3_w, x4_w, y3_w)
@benchmark sub_threaded(x3_w, x4_w, y3_w)
@benchmark sub_array(x3_w, x4_w, y3_w)
The benchmark run about 10x faster when operating on the transposed arrays x3
and x4
since the shifted columns are contiguous in memory. I’m surprised however that the threaded version is about 20% slower than the single-threaded loop:
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.594 s … 1.624 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.599 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.604 s ± 13.505 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █
█▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.59 s Histogram: frequency by time 1.62 s <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sub_threaded(x3_w, x4_w, y3_w)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 1.884 s … 1.948 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.916 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.916 s ± 31.815 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.88 s Histogram: frequency by time 1.95 s <
Memory estimate: 31.19 KiB, allocs estimate: 289.
I’m running this benchmark on a 48-core machine with Threads.nthreads()=48
.
Does anyone have thoughts on how to further speed up these array operations (besides moving to GPU)?