Speeding up operations on large arrays

Hi,
I’m working on a code that needs to calculate many, simple arithmetic on slices of large arrays.
Ideally I’d like to understand how to speed up this code and explored multiple approaches:

function sub_single(x1, x2, res)
    @simd for i ∈ eachindex(res)
        @inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
    end
end

function sub_threaded(x1, x2, res)
    Threads.@threads for i ∈ eachindex(res)
        @inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
    end
end

# Set up data arrays
num_ptl = 400_000
num_chunk = 40_000
N_time = 10_000
Δ = 10

# Big base arrays
x1 = randn(num_ptl, N_time);
x2 = randn(num_ptl, N_time);
y1 = zeros(size(x1));

# Transpose arrays with permutedim, which re-arranges memory
x3 = permutedims(x1);
x4 = permutedims(x2);
y3 = zeros(size(x3));

# We work on random selections
p_ix = rand(1:num_ptl, num_chunk);


# Use views for quick access
x1_w = view(x1, p_ix, 1:N_time-Δ);
x2_w = view(x2, p_ix, 1+Δ:N_time);
y1_w = view(y1, p_ix, 1:N_time-Δ);
# Views on transposed arrays
x3_w = view(x3, 1:N_time-Δ, p_ix);
x4_w = view(x4, 1+Δ:N_time, p_ix);
y3_w = view(y3, 1:N_time-Δ, p_ix);

@benchmark sub_single(x1_w, x2_w, y1_w)
@benchmark sub_threaded(x1_w, x2_w, y1_w)
@benchmark sub_array(x1_w, x2_w, y1_w)

"""
    Test if array ordering has an effect
        - Yes, about 10x.
"""

@benchmark sub_single(x3_w, x4_w, y3_w)
@benchmark sub_threaded(x3_w, x4_w, y3_w)
@benchmark sub_array(x3_w, x4_w, y3_w)

The benchmark run about 10x faster when operating on the transposed arrays x3 and x4 since the shifted columns are contiguous in memory. I’m surprised however that the threaded version is about 20% slower than the single-threaded loop:

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.594 s …   1.624 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.599 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.604 s ± 13.505 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   █        █                                          █  
  █▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.59 s         Histogram: frequency by time        1.62 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sub_threaded(x3_w, x4_w, y3_w)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.884 s …   1.948 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.916 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.916 s ± 31.815 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                           █                           █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.88 s         Histogram: frequency by time        1.95 s <

 Memory estimate: 31.19 KiB, allocs estimate: 289.

I’m running this benchmark on a 48-core machine with Threads.nthreads()=48.
Does anyone have thoughts on how to further speed up these array operations (besides moving to GPU)?

Try:

1 Like

The best way to speed it up is to not do this — as much as possible, you want to “fuse” your operations into a single loop over your large array that does lots of work per iteration, rather than many separate simple loops that do little work per iteration. (Also pay attention to the order in which you traverse the array.)

The name of the game with operations on large arrays is to do as much work as possible with each element or small chunk of data before moving on to the next element/chunk. The reason for this is that memory is the slowest part of your computer — and the more memory you access, the slower it gets.

3 Likes

View don’t actually make for quick access, but rather they save memory. With scrambled indices, sequential access is probably a good deal slower.

1 Like

Right. But I have lots of large arrays and need to calculate the squared difference between the
data points. It’s one big loop over the arrays where I need to calculate a squared difference.
So, as you suggest, performance of the calculation is probably bound by memory bandwidth, not by computational speed.

In production, I scan over a range of Deltas. It looks like the best optimization strategy in this case
is to pre-allocate memory. These about 5x longer than performing the actual operations over the selected memory region.

Only the p_ix are scrambled. You are right that accessing the arrays sequentially over this scrambled index kills performance. When permuting the array data using permutedims, eachindex accesses the sequential data sequentially. That’s why the same benchmarks on x3_w and x4_w run 10x faster than on x1_w and x2_w.

Allocating memory takes about 8s for arrays of this size. That’s why I prefer to use views.
In production, the stepping Delta will change. But I can still use the pre-allocated result matrix y1_w / y3_w. This saves a lot of time when iterating over Delta.

Thanks, I didn’t know about that package. Turns out that LoopVectorization.check_args does not accept views and I need to pass a copy of the array:

julia> @benchmark sub_tturbo(x3[1:(N_time - Δ), p_ix], x4[(1+Δ):N_time, p_ix], y3[1:(N_time - Δ), p_ix])
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.338 s (0.03% GC) to evaluate,
 with a memory estimate of 8.93 GiB, over 28 allocations.