Hi,

I’m working on a code that needs to calculate many, simple arithmetic on slices of large arrays.

Ideally I’d like to understand how to speed up this code and explored multiple approaches:

```
function sub_single(x1, x2, res)
@simd for i ∈ eachindex(res)
@inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
end
end
function sub_threaded(x1, x2, res)
Threads.@threads for i ∈ eachindex(res)
@inbounds res[i] = (x1[i] - x2[i]) * (x1[i] - x2[i])
end
end
# Set up data arrays
num_ptl = 400_000
num_chunk = 40_000
N_time = 10_000
Δ = 10
# Big base arrays
x1 = randn(num_ptl, N_time);
x2 = randn(num_ptl, N_time);
y1 = zeros(size(x1));
# Transpose arrays with permutedim, which re-arranges memory
x3 = permutedims(x1);
x4 = permutedims(x2);
y3 = zeros(size(x3));
# We work on random selections
p_ix = rand(1:num_ptl, num_chunk);
# Use views for quick access
x1_w = view(x1, p_ix, 1:N_time-Δ);
x2_w = view(x2, p_ix, 1+Δ:N_time);
y1_w = view(y1, p_ix, 1:N_time-Δ);
# Views on transposed arrays
x3_w = view(x3, 1:N_time-Δ, p_ix);
x4_w = view(x4, 1+Δ:N_time, p_ix);
y3_w = view(y3, 1:N_time-Δ, p_ix);
@benchmark sub_single(x1_w, x2_w, y1_w)
@benchmark sub_threaded(x1_w, x2_w, y1_w)
@benchmark sub_array(x1_w, x2_w, y1_w)
"""
Test if array ordering has an effect
- Yes, about 10x.
"""
@benchmark sub_single(x3_w, x4_w, y3_w)
@benchmark sub_threaded(x3_w, x4_w, y3_w)
@benchmark sub_array(x3_w, x4_w, y3_w)
```

The benchmark run about 10x faster when operating on the transposed arrays `x3`

and `x4`

since the shifted columns are contiguous in memory. I’m surprised however that the threaded version is about 20% slower than the single-threaded loop:

```
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.594 s … 1.624 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.599 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.604 s ± 13.505 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █
█▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.59 s Histogram: frequency by time 1.62 s <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sub_threaded(x3_w, x4_w, y3_w)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 1.884 s … 1.948 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.916 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.916 s ± 31.815 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.88 s Histogram: frequency by time 1.95 s <
Memory estimate: 31.19 KiB, allocs estimate: 289.
```

I’m running this benchmark on a 48-core machine with `Threads.nthreads()=48`

.

Does anyone have thoughts on how to further speed up these array operations (besides moving to GPU)?