Reduce memory allocated in array view and in place sum

ian · November 8, 2023, 10:21pm

Hi all. New Julia user here. I want to significantly increase performance/reduce the amount of memory allocated in a function similar to the MWE below. Basically I need to access the elements of an array (sliced using a matrix), multiply those elements by another matrix, and then sum the columns of the result.

using BenchmarkTools

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

@benchmark viewmultsum!(result, coefficients, state, randidxs)

BenchmarkTools.Trial: 5263 samples with 1 evaluation.
 Range (min … max):  581.567 μs …   9.461 ms  ┊ GC (min … max):  0.00% … 90.38%
 Time  (median):     760.413 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   941.917 μs ± 660.843 μs  ┊ GC (mean ± σ):  17.49% ± 18.97%

  ▃▂▅█▆▅▃▂▁                                            ▁▁ ▁     ▁
  ██████████▇▇▆▄▃▄▁▃▃▁▃▃▁▁▁▁▃▁▁▃▃▃▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█████▇▇▇▆ █
  582 μs        Histogram: log(frequency) by time       3.53 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

I’ll need to call a function similar to this several thousand times while integrating a very large system of ODES (n > 1e4) so it is performance critical. I thought using @views and summing inplace with sum! would mean that I don’t have to allocate much memory. I only have 2 allocations in the MWE but they seem quite large and I don’t know exactly where they come from.

I know accessing the elements of state in a random manner is not ideal for cache performance, but in the real ODE system there is no structured way I can access its elements. In short the poor cache performance may be unavoidable.

Is there any way I can reduce the allocations in the MWE? Or is the best bet for performance improvement to look at parallelization using hardware like GPUs? Any other performance tips/pointers are much appreciated.

Thanks!

mbauman · November 8, 2023, 10:53pm

@views only prevents allocations from indexing; the .* is still creating a temporary array that’s passed to the sum! function. That’s your allocation (there’s two, but it’s just one array: one for the array header; the other for the 2.75MB (3003004*8) of data).

You’ll probably see better performance if you just write out the for loop(s) yourself.

danielwe · November 8, 2023, 11:09pm

You can use MappedArrays.jl to remove the allocations without having to write the loop yourself:

using MappedArrays

function viewmultsummapped!(result, coefficients, state, randidxs)
           @views sum!(result, mappedarray(*, coefficients, state[randidxs]))
       end;

@benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)

BenchmarkTools.Trial: 5860 samples with 1 evaluation.
 Range (min … max):  814.573 μs …  1.528 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     834.977 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   845.770 μs ± 34.106 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▄▆▇███▇▇▆▅▅▅▄▄▄▃▃▁▁▂▁▁▁    ▁ ▁        ▁  ▁                 ▂
  ██████████████████████████▇███████▇▇█▇██████████▇▇▇▆▆▆▄▄▅▇▅▅ █
  815 μs        Histogram: log(frequency) by time       967 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

For comparison, the minimum time when I ran your code was 1.115 ms, so I got a speedup of 27 %.

danielwe · November 9, 2023, 12:01am

As you suggest, GPU can help at these sizes. Your original code works as-is, you only need to move the data. (MappedArrays won’t work, however.) Here are some benchmarks on my 2017 laptop with an NVIDIA GeForce GTX 1050. The GPU gives a 2x speedup over the MappedArray version.

using BenchmarkTools
using CUDA
using MappedArrays

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

cucoefficients = CuArray(coefficients);
custate = CuArray(state);
curandidxs = CuArray(randidxs);
curesult = CuArray(result);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

function viewmultsummapped!(result, coefficients, state, randidxs)
    @views sum!(result, mappedarray(*, coefficients, state[randidxs]))
end;

julia> @benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 5876 samples with 1 evaluation.
 Range (min … max):  662.337 μs …   2.906 ms  ┊ GC (min … max): 0.00% … 27.90%
 Time  (median):     705.030 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   842.699 μs ± 405.666 μs  ┊ GC (mean ± σ):  6.97% ± 11.42%

  █▇▆▅▄▃▁                      ▁ ▁  ▁▁▁                      ▁  ▁
  ████████▇▇▆▆▆▅▃▄▄▅▁▁▃▁▁▁▃▁▁▃▁████▇████▇▆▅▅▄▃▃▁▄▄▁▃▃▁▁▁▁▁▁█▇██ █
  662 μs        Histogram: log(frequency) by time       2.58 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

julia> @benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  438.168 μs …  1.013 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     456.494 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   472.989 μs ± 39.027 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▅▃                                                        
  ▁▄█████▆▅▄▄▃▂▂▂▃▂▂▂▂▂▂▂▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  438 μs          Histogram: frequency by time          613 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark viewmultsum!($curesult, $cucoefficients, $custate, $curandidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  219.703 μs …  14.879 ms  ┊ GC (min … max): 0.00% … 35.75%
 Time  (median):     262.649 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.617 μs ± 370.255 μs  ┊ GC (mean ± σ):  1.31% ±  0.95%

                              ▁▄▇███▇▅▄▃▂▁▁▁▁▁▁▁ ▁              ▂
  ▄▁▃▃▁▁▅▄▄▃▅▆▅▅▅▇▆▇█▇▇██▇▇▆▆▆██████████████████████▇▇▇▆▇▇█▇▇▆▇ █
  220 μs        Histogram: log(frequency) by time        300 μs <

 Memory estimate: 11.92 KiB, allocs estimate: 213.

(I switched my laptop from battery saver to performance mode for these benchmarks, that’s why the CPU versions are twice as fast as in my previous post.)

Benny · November 9, 2023, 12:01am

If you can guarantee that coefficients and state[randidxs] have the same shapes each time, you could make another preallocated buffer like result for the intermediate array.

# bad quick way to do it, should specify shape and type
intermediate = coefficients .* state[randidxs];

function viewmultsum2!(result, coefficients, state, randidxs, intermediate)
    intermediate .= coefficients .* @view(state[randidxs])
    sum!(result, intermediate)
end;

@benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
@benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
@benchmark viewmultsum2!($result, $coefficients, $state, $randidxs, $intermediate)

benchmark results

julia> @benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 3690 samples with 1 evaluation.
 Range (min … max):  404.610 μs … 60.644 ms  ┊ GC (min … max):  0.00% …  0.00%
 Time  (median):     485.685 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.322 ms ±  5.786 ms  ┊ GC (mean ± σ):  11.65% ± 13.25%

  █▃▂                                                           
  ███▇▄▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▆ █
  405 μs        Histogram: log(frequency) by time        49 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

julia> @benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 7512 samples with 1 evaluation.
 Range (min … max):  290.310 μs … 53.160 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     314.765 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   661.857 μs ±  4.113 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆█▇▃                                                         
  ▅████▇▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  290 μs          Histogram: frequency by time          719 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark viewmultsum2!($result, $coefficients, $state, $randidxs, $intermediate)
BenchmarkTools.Trial: 6019 samples with 1 evaluation.
 Range (min … max):  352.540 μs … 53.508 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     393.821 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   826.631 μs ±  4.602 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▆█▅▃                                                        
  ▆█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▁▁▂ ▃
  353 μs          Histogram: frequency by time          987 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I would hazard a guess at the mapped array method being more efficient than the intermediate buffer method by doing fewer array element assignments. Both don’t allocate, and all 3 give the same result.

DNF · November 9, 2023, 7:20am

I would recommend at least trying a loop implementation for reference.

danielwe · November 9, 2023, 8:42am

You can indeed beat MappedArrays by writing out the loop, but only if you use @inbounds (at least I couldn’t do it without @inbounds). Of course, when using @inbounds you should always check the validity of the indices in some other way within the same function.

In your case, you need to be extra careful not to apply @inbounds to the indexing of state, since you cannot prove that an arbitrary element of randidxs is a valid index for state using information available within the function

function loopmultsum!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    for j in js
        for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;


@benchmark loopmultsum!($result, $coefficients, $state, $randidxs)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  300.811 μs … 618.805 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     306.791 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   314.838 μs ±  22.994 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▇▅▄▃▂▁▂▃▂▁         ▁▂▁▁                                    ▂
  ███████████████████▇▇██████▇▇▇▇▆▇▇▇▇▆▆▅▅▆▇▇▆▆▆▅▅▄▅▅▄▅▅▄▅▅▄▄▅▅ █
  301 μs        Histogram: log(frequency) by time        418 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This benchmark is from the same machine as those in my previous post. The numbers split the difference between MappedArrays and CUDA.

mcabbott · November 9, 2023, 12:16pm

Loops with multi-threading:

julia> function loopmultsum2!(result, coefficients, state, randidxs)
           is, js = axes(coefficients)
           if (is != eachindex(result)) || ((is, js) != axes(randidxs))
               error("mismatched array sizes")
           end
           # This bounds-check takes 200μs, allows safe @inbounds loop below to be just 80μs
           # all(in(eachindex(state)), extrema(randidxs)) || error("out of bounds indices!")  # [Edit, comment was wrong before]
           Threads.@threads for i in is
               acc = zero(eltype(result))
               for j in js
                   @inbounds c, r = coefficients[i, j], randidxs[i, j]
                   acc += c * state[r]
               end
               @inbounds result[i] = acc  
            end
       end;

julia> @btime loopmultsum2!($result, $coefficients, $state, $randidxs);
  min 115.916 μs, mean 201.250 μs (22 allocations, 2.22 KiB)

# Compare earlier loops, and original:
julia> @btime loopmultsum!($result, $coefficients, $state, $randidxs);
  min 345.042 μs, mean 386.605 μs (0 allocations)

julia> @btime viewmultsum!($result, $coefficients, $state, $randidxs);
  min 494.833 μs, mean 852.841 μs (3 allocations, 2.75 MiB)

Removing the bounds check in state[r] brings this to min 80ms, and adding LoopVectorization.@tturbo further reduces it to min 46.500 μs, mean 81.436 μs (0 allocations). If you know what made randidxs, then perhaps these are safe.

ian · November 9, 2023, 3:26pm

Thanks for the help everyone! This is exactly what I needed

danielwe · November 9, 2023, 8:46pm

Even with multithreading, it’s faster to iterate in memory order, i.e., let the loop over i be the inner loop, even though it requires more reads and writes to result. This is with julia --threads=4:

function loopmultsumthreads!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    for j in js
        Threads.@threads for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;

julia> @benchmark loopmultsumthreads!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  113.197 μs …  3.923 ms  ┊ GC (min … max): 0.00% … 89.74%
 Time  (median):     125.924 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   128.471 μs ± 53.589 μs  ┊ GC (mean ± σ):  0.54% ±  1.27%
 [...]
 Memory estimate: 9.69 KiB, allocs estimate: 98.

julia> #mcabbott's version
       @benchmark loopmultsum2!($result, $coefficients, $state, $randidxs) 
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  139.301 μs … 420.524 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     149.647 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   152.976 μs ±  16.754 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 [...]
 Memory estimate: 2.34 KiB, allocs estimate: 24.

And to anyone worried about the branch in the inner loop in my versions, don’t worry, the compiler rewrites this and pulls the branch outside the loop exactly the way you’re itching to do by hand.

(Btw., @mcabbott, your commented-out bounds check should use eachindex(state), not eachindex(result).)

Edit: The difference is not just iteration order, but whether the accumulator is initialized to zero or directly to its first value. Here’s a slight modification of loopmultsum2! that comes closer to loopmultsumthreads!:

loopmultsum3!

function loopmultsum3!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    j1 = first(js)
    jrest = @view(js[(begin + 1):end])
    Threads.@threads for i in is
        @inbounds c, r = coefficients[i, j1], randidxs[i, j1]
        s = state[r]
        acc = c * s
        for j in jrest
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            acc = muladd(c, s, acc)
        end
        @inbounds result[i] = acc
    end
end;

julia> @benchmark loopmultsum3!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  120.678 μs … 629.706 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     136.230 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   141.047 μs ±  21.892 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 [...]
 Memory estimate: 2.41 KiB, allocs estimate: 24.

mcabbott · November 9, 2023, 11:20pm

I certainly don’t mean to claim mine is the last word. Playing with memory order would be one more avenue, e.g. transposing coefficients may or may not help.

However, on my machine, this loopmultsumthreads! is 30% slower. And it gives an error with @tturbo on the outer loop (and no benefit on inner?). Perhaps I’ve been conditioned to write code simple & branch-free enough that LoopVectorization likes it.

(Good point about the bounds check, fixed above now.)

danielwe · November 10, 2023, 12:35am

That’s interesting. Is it an architecture thing where your machine is less sensitive to memory order and more sensitive to optimal thread utilization? I’m on Intel x86, specifically i7-7700HQ.

Or does the branch thing not get optimized on your end (no idea why that would be)? What if you try doing it by hand like this (makes no difference either way for me):

function loopmultsumthreads2!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    j1 = first(js)
    Threads.@threads for i in is
        @inbounds c, r = coefficients[i, j1], randidxs[i, j1]
        s = state[r]
        @inbounds result[i] = c * s
    end
    for j in @view(js[(begin + 1):end])
        Threads.@threads for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = muladd(c, s, result[i])
        end
    end
end;

Yeah, I didn’t try LoopVectorization.

Dan · November 10, 2023, 3:01am

A SIMD version is slightly (10%-20%) faster on my machine. It depends on the data having 4 columns like the example in the OP (for wider data the needed overhead may make it less competitive):

using SIMD

coefficients_vec = [Vec{4, Float64}(tuple(r...)) for r in eachrow(coefficients)];
randidxs_vec = [Vec{4, Int}(tuple(r...)) for r in eachrow(randidxs)];

function loopmultsumsimd!(result, coef_vec, state, idx_vec)
    for i in eachindex(sc)
        result[i] = sum(vgather(state, idx_vec[i])*coef_vec[i])
    end
end

with these defined and the initialization from the OP:

using BenchmarkTools

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

we get the following:

julia> viewmultsum!(result, coefficients, state, randidxs);

julia> r1 = copy(result);

julia> loopmultsumsimd!(result, coefficients_vec, state, randidxs_vec);

julia> result ≈ r1     # verify correctness
true

julia> @benchmark loopmultsumsimd!($result, $coefficients_vec, state, $randidxs_vec)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  159.655 μs … 525.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     178.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   182.021 μs ±  13.280 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▃▄██▇▇▇▃                                           
  ▁▁▁▁▁▁▁▁▂▂▄██████████▇▆▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▃
  160 μs           Histogram: frequency by time          227 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Apart from using the narrow 4 column matrices which fit nicely into SIMD variables this solution also uses vgather which is efficient for indexing with randidxs.

Topic		Replies	Views
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	876	July 19, 2019
Unclear allocation behaviour with built-in sum() Performance question , memory-allocation	4	203	February 2, 2025
Shouldn't `@views sum(data[mask])` be non-allocating? General Usage memory-allocation	19	452	August 14, 2024
Common allocation mistakes Performance memory-allocation	47	7099	August 21, 2023
Optimization of array views Performance	6	617	August 8, 2018

Reduce memory allocated in array view and in place sum

Related topics