Speeding up operations on large arrays

rkube · December 21, 2022, 4:20pm

Hi,
I’m working on a code that needs to calculate many, simple arithmetic on slices of large arrays.
Ideally I’d like to understand how to speed up this code and explored multiple approaches:

function sub_single(x1, x2, res)
    @simd for i ∈ eachindex(res)
        @inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
    end
end

function sub_threaded(x1, x2, res)
    Threads.@threads for i ∈ eachindex(res)
        @inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
    end
end

# Set up data arrays
num_ptl = 400_000
num_chunk = 40_000
N_time = 10_000
Δ = 10

# Big base arrays
x1 = randn(num_ptl, N_time);
x2 = randn(num_ptl, N_time);
y1 = zeros(size(x1));

# Transpose arrays with permutedim, which re-arranges memory
x3 = permutedims(x1);
x4 = permutedims(x2);
y3 = zeros(size(x3));

# We work on random selections
p_ix = rand(1:num_ptl, num_chunk);


# Use views for quick access
x1_w = view(x1, p_ix, 1:N_time-Δ);
x2_w = view(x2, p_ix, 1+Δ:N_time);
y1_w = view(y1, p_ix, 1:N_time-Δ);
# Views on transposed arrays
x3_w = view(x3, 1:N_time-Δ, p_ix);
x4_w = view(x4, 1+Δ:N_time, p_ix);
y3_w = view(y3, 1:N_time-Δ, p_ix);

@benchmark sub_single(x1_w, x2_w, y1_w)
@benchmark sub_threaded(x1_w, x2_w, y1_w)
@benchmark sub_array(x1_w, x2_w, y1_w)

"""
    Test if array ordering has an effect
        - Yes, about 10x.
"""

@benchmark sub_single(x3_w, x4_w, y3_w)
@benchmark sub_threaded(x3_w, x4_w, y3_w)
@benchmark sub_array(x3_w, x4_w, y3_w)

The benchmark run about 10x faster when operating on the transposed arrays x3 and x4 since the shifted columns are contiguous in memory. I’m surprised however that the threaded version is about 20% slower than the single-threaded loop:

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.594 s …   1.624 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.599 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.604 s ± 13.505 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   █        █                                          █  
  █▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.59 s         Histogram: frequency by time        1.62 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sub_threaded(x3_w, x4_w, y3_w)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.884 s …   1.948 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.916 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.916 s ± 31.815 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                           █                           █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.88 s         Histogram: frequency by time        1.95 s <

 Memory estimate: 31.19 KiB, allocs estimate: 289.

I’m running this benchmark on a 48-core machine with Threads.nthreads()=48.
Does anyone have thoughts on how to further speed up these array operations (besides moving to GPU)?

lmiq · December 21, 2022, 4:30pm

Try:

stevengj · December 21, 2022, 4:34pm

The best way to speed it up is to not do this — as much as possible, you want to “fuse” your operations into a single loop over your large array that does lots of work per iteration, rather than many separate simple loops that do little work per iteration. (Also pay attention to the order in which you traverse the array.)

The name of the game with operations on large arrays is to do as much work as possible with each element or small chunk of data before moving on to the next element/chunk. The reason for this is that memory is the slowest part of your computer — and the more memory you access, the slower it gets.

DNF · December 21, 2022, 4:44pm

View don’t actually make for quick access, but rather they save memory. With scrambled indices, sequential access is probably a good deal slower.

rkube · December 21, 2022, 4:59pm

Right. But I have lots of large arrays and need to calculate the squared difference between the
data points. It’s one big loop over the arrays where I need to calculate a squared difference.
So, as you suggest, performance of the calculation is probably bound by memory bandwidth, not by computational speed.

In production, I scan over a range of Deltas. It looks like the best optimization strategy in this case
is to pre-allocate memory. These about 5x longer than performing the actual operations over the selected memory region.

rkube · December 21, 2022, 5:04pm

Only the p_ix are scrambled. You are right that accessing the arrays sequentially over this scrambled index kills performance. When permuting the array data using permutedims, eachindex accesses the sequential data sequentially. That’s why the same benchmarks on x3_w and x4_w run 10x faster than on x1_w and x2_w.

Allocating memory takes about 8s for arrays of this size. That’s why I prefer to use views.
In production, the stepping Delta will change. But I can still use the pre-allocated result matrix y1_w / y3_w. This saves a lot of time when iterating over Delta.

rkube · December 21, 2022, 5:06pm

Thanks, I didn’t know about that package. Turns out that LoopVectorization.check_args does not accept views and I need to pass a copy of the array:

julia> @benchmark sub_tturbo(x3[1:(N_time - Δ), p_ix], x4[(1+Δ):N_time, p_ix], y3[1:(N_time - Δ), p_ix])
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.338 s (0.03% GC) to evaluate,
 with a memory estimate of 8.93 GiB, over 28 allocations.

Topic		Replies	Views
Array indexing/slicing is slow General Usage array	9	608	April 19, 2023
Threads.@threads slow down when arrays become too large New to Julia	1	900	November 29, 2021
Reduce memory allocated in array view and in place sum Performance question	12	694	November 10, 2023
Multiplication after transpose much faster than multiplication after PermutedDimsArray Performance linearalgebra	3	1639	April 10, 2019
How to achieve the greatest speed with nested loops? Performance	10	1910	July 24, 2020

Speeding up operations on large arrays

Related topics