Reduce memory allocated in array view and in place sum

Hi all. New Julia user here. I want to significantly increase performance/reduce the amount of memory allocated in a function similar to the MWE below. Basically I need to access the elements of an array (sliced using a matrix), multiply those elements by another matrix, and then sum the columns of the result.

using BenchmarkTools

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

@benchmark viewmultsum!(result, coefficients, state, randidxs)
BenchmarkTools.Trial: 5263 samples with 1 evaluation.
 Range (min … max):  581.567 ΞΌs …   9.461 ms  β”Š GC (min … max):  0.00% … 90.38%
 Time  (median):     760.413 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   941.917 ΞΌs Β± 660.843 ΞΌs  β”Š GC (mean Β± Οƒ):  17.49% Β± 18.97%

  β–ƒβ–‚β–…β–ˆβ–†β–…β–ƒβ–‚β–                                            ▁▁ ▁     ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–„β–ƒβ–„β–β–ƒβ–ƒβ–β–ƒβ–ƒβ–β–β–β–β–ƒβ–β–β–ƒβ–ƒβ–ƒβ–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–† β–ˆ
  582 ΞΌs        Histogram: log(frequency) by time       3.53 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

I’ll need to call a function similar to this several thousand times while integrating a very large system of ODES (n > 1e4) so it is performance critical. I thought using @views and summing inplace with sum! would mean that I don’t have to allocate much memory. I only have 2 allocations in the MWE but they seem quite large and I don’t know exactly where they come from.

I know accessing the elements of state in a random manner is not ideal for cache performance, but in the real ODE system there is no structured way I can access its elements. In short the poor cache performance may be unavoidable.

Is there any way I can reduce the allocations in the MWE? Or is the best bet for performance improvement to look at parallelization using hardware like GPUs? Any other performance tips/pointers are much appreciated.

Thanks!

@views only prevents allocations from indexing; the .* is still creating a temporary array that’s passed to the sum! function. That’s your allocation (there’s two, but it’s just one array: one for the array header; the other for the 2.75MB (3003004*8) of data).

You’ll probably see better performance if you just write out the for loop(s) yourself.

2 Likes

You can use MappedArrays.jl to remove the allocations without having to write the loop yourself:

using MappedArrays

function viewmultsummapped!(result, coefficients, state, randidxs)
           @views sum!(result, mappedarray(*, coefficients, state[randidxs]))
       end;

@benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 5860 samples with 1 evaluation.
 Range (min … max):  814.573 ΞΌs …  1.528 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     834.977 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   845.770 ΞΌs Β± 34.106 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–„β–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–…β–…β–„β–„β–„β–ƒβ–ƒβ–β–β–‚β–β–β–    ▁ ▁        ▁  ▁                 β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–†β–„β–„β–…β–‡β–…β–… β–ˆ
  815 ΞΌs        Histogram: log(frequency) by time       967 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

For comparison, the minimum time when I ran your code was 1.115 ms, so I got a speedup of 27 %.

5 Likes

As you suggest, GPU can help at these sizes. Your original code works as-is, you only need to move the data. (MappedArrays won’t work, however.) Here are some benchmarks on my 2017 laptop with an NVIDIA GeForce GTX 1050. The GPU gives a 2x speedup over the MappedArray version.

using BenchmarkTools
using CUDA
using MappedArrays

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

cucoefficients = CuArray(coefficients);
custate = CuArray(state);
curandidxs = CuArray(randidxs);
curesult = CuArray(result);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

function viewmultsummapped!(result, coefficients, state, randidxs)
    @views sum!(result, mappedarray(*, coefficients, state[randidxs]))
end;
julia> @benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 5876 samples with 1 evaluation.
 Range (min … max):  662.337 ΞΌs …   2.906 ms  β”Š GC (min … max): 0.00% … 27.90%
 Time  (median):     705.030 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   842.699 ΞΌs Β± 405.666 ΞΌs  β”Š GC (mean Β± Οƒ):  6.97% Β± 11.42%

  β–ˆβ–‡β–†β–…β–„β–ƒβ–                      ▁ ▁  ▁▁▁                      ▁  ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–…β–ƒβ–„β–„β–…β–β–β–ƒβ–β–β–β–ƒβ–β–β–ƒβ–β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–ƒβ–ƒβ–β–„β–„β–β–ƒβ–ƒβ–β–β–β–β–β–β–ˆβ–‡β–ˆβ–ˆ β–ˆ
  662 ΞΌs        Histogram: log(frequency) by time       2.58 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

julia> @benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  438.168 ΞΌs …  1.013 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     456.494 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   472.989 ΞΌs Β± 39.027 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–„β–ˆβ–…β–ƒ                                                        
  β–β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  438 ΞΌs          Histogram: frequency by time          613 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark viewmultsum!($curesult, $cucoefficients, $custate, $curandidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  219.703 ΞΌs …  14.879 ms  β”Š GC (min … max): 0.00% … 35.75%
 Time  (median):     262.649 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   273.617 ΞΌs Β± 370.255 ΞΌs  β”Š GC (mean Β± Οƒ):  1.31% Β±  0.95%

                              β–β–„β–‡β–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–‚β–β–β–β–β–β–β– ▁              β–‚
  β–„β–β–ƒβ–ƒβ–β–β–…β–„β–„β–ƒβ–…β–†β–…β–…β–…β–‡β–†β–‡β–ˆβ–‡β–‡β–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–ˆβ–‡β–‡β–†β–‡ β–ˆ
  220 ΞΌs        Histogram: log(frequency) by time        300 ΞΌs <

 Memory estimate: 11.92 KiB, allocs estimate: 213.

(I switched my laptop from battery saver to performance mode for these benchmarks, that’s why the CPU versions are twice as fast as in my previous post.)

1 Like

If you can guarantee that coefficients and state[randidxs] have the same shapes each time, you could make another preallocated buffer like result for the intermediate array.

# bad quick way to do it, should specify shape and type
intermediate = coefficients .* state[randidxs];

function viewmultsum2!(result, coefficients, state, randidxs, intermediate)
    intermediate .= coefficients .* @view(state[randidxs])
    sum!(result, intermediate)
end;

@benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
@benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
@benchmark viewmultsum2!($result, $coefficients, $state, $randidxs, $intermediate)
benchmark results
julia> @benchmark viewmultsum!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 3690 samples with 1 evaluation.
 Range (min … max):  404.610 ΞΌs … 60.644 ms  β”Š GC (min … max):  0.00% …  0.00%
 Time  (median):     485.685 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.322 ms Β±  5.786 ms  β”Š GC (mean Β± Οƒ):  11.65% Β± 13.25%

  β–ˆβ–ƒβ–‚                                                           
  β–ˆβ–ˆβ–ˆβ–‡β–„β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–† β–ˆ
  405 ΞΌs        Histogram: log(frequency) by time        49 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 2.

julia> @benchmark viewmultsummapped!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 7512 samples with 1 evaluation.
 Range (min … max):  290.310 ΞΌs … 53.160 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     314.765 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   661.857 ΞΌs Β±  4.113 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–†β–ˆβ–‡β–ƒ                                                         
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚ β–ƒ
  290 ΞΌs          Histogram: frequency by time          719 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark viewmultsum2!($result, $coefficients, $state, $randidxs, $intermediate)
BenchmarkTools.Trial: 6019 samples with 1 evaluation.
 Range (min … max):  352.540 ΞΌs … 53.508 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     393.821 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   826.631 ΞΌs Β±  4.602 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–β–†β–ˆβ–…β–ƒ                                                        
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–‚ β–ƒ
  353 ΞΌs          Histogram: frequency by time          987 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I would hazard a guess at the mapped array method being more efficient than the intermediate buffer method by doing fewer array element assignments. Both don’t allocate, and all 3 give the same result.

1 Like

I would recommend at least trying a loop implementation for reference.

You can indeed beat MappedArrays by writing out the loop, but only if you use @inbounds (at least I couldn’t do it without @inbounds). Of course, when using @inbounds you should always check the validity of the indices in some other way within the same function.

In your case, you need to be extra careful not to apply @inbounds to the indexing of state, since you cannot prove that an arbitrary element of randidxs is a valid index for state using information available within the function

function loopmultsum!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    for j in js
        for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;


@benchmark loopmultsum!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  300.811 ΞΌs … 618.805 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     306.791 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   314.838 ΞΌs Β±  22.994 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–ˆβ–‡β–…β–„β–ƒβ–‚β–β–‚β–ƒβ–‚β–         ▁▂▁▁                                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–†β–…β–…β–†β–‡β–‡β–†β–†β–†β–…β–…β–„β–…β–…β–„β–…β–…β–„β–…β–…β–„β–„β–…β–… β–ˆ
  301 ΞΌs        Histogram: log(frequency) by time        418 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This benchmark is from the same machine as those in my previous post. The numbers split the difference between MappedArrays and CUDA.

1 Like

Loops with multi-threading:

julia> function loopmultsum2!(result, coefficients, state, randidxs)
           is, js = axes(coefficients)
           if (is != eachindex(result)) || ((is, js) != axes(randidxs))
               error("mismatched array sizes")
           end
           # This bounds-check takes 200ΞΌs, allows safe @inbounds loop below to be just 80ΞΌs
           # all(in(eachindex(state)), extrema(randidxs)) || error("out of bounds indices!")  # [Edit, comment was wrong before]
           Threads.@threads for i in is
               acc = zero(eltype(result))
               for j in js
                   @inbounds c, r = coefficients[i, j], randidxs[i, j]
                   acc += c * state[r]
               end
               @inbounds result[i] = acc  
            end
       end;

julia> @btime loopmultsum2!($result, $coefficients, $state, $randidxs);
  min 115.916 ΞΌs, mean 201.250 ΞΌs (22 allocations, 2.22 KiB)

# Compare earlier loops, and original:
julia> @btime loopmultsum!($result, $coefficients, $state, $randidxs);
  min 345.042 ΞΌs, mean 386.605 ΞΌs (0 allocations)

julia> @btime viewmultsum!($result, $coefficients, $state, $randidxs);
  min 494.833 ΞΌs, mean 852.841 ΞΌs (3 allocations, 2.75 MiB)

Removing the bounds check in state[r] brings this to min 80ms, and adding LoopVectorization.@tturbo further reduces it to min 46.500 ΞΌs, mean 81.436 ΞΌs (0 allocations). If you know what made randidxs, then perhaps these are safe.

2 Likes

Thanks for the help everyone! This is exactly what I needed

1 Like

Even with multithreading, it’s faster to iterate in memory order, i.e., let the loop over i be the inner loop, even though it requires more reads and writes to result. This is with julia --threads=4:

function loopmultsumthreads!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    for j in js
        Threads.@threads for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;
julia> @benchmark loopmultsumthreads!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  113.197 ΞΌs …  3.923 ms  β”Š GC (min … max): 0.00% … 89.74%
 Time  (median):     125.924 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   128.471 ΞΌs Β± 53.589 ΞΌs  β”Š GC (mean Β± Οƒ):  0.54% Β±  1.27%
 [...]
 Memory estimate: 9.69 KiB, allocs estimate: 98.

julia> #mcabbott's version
       @benchmark loopmultsum2!($result, $coefficients, $state, $randidxs) 
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  139.301 ΞΌs … 420.524 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     149.647 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   152.976 ΞΌs Β±  16.754 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
 [...]
 Memory estimate: 2.34 KiB, allocs estimate: 24.

And to anyone worried about the branch in the inner loop in my versions, don’t worry, the compiler rewrites this and pulls the branch outside the loop exactly the way you’re itching to do by hand.

(Btw., @mcabbott, your commented-out bounds check should use eachindex(state), not eachindex(result).)

Edit: The difference is not just iteration order, but whether the accumulator is initialized to zero or directly to its first value. Here’s a slight modification of loopmultsum2! that comes closer to loopmultsumthreads!:

loopmultsum3!
function loopmultsum3!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    j1 = first(js)
    jrest = @view(js[(begin + 1):end])
    Threads.@threads for i in is
        @inbounds c, r = coefficients[i, j1], randidxs[i, j1]
        s = state[r]
        acc = c * s
        for j in jrest
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            acc = muladd(c, s, acc)
        end
        @inbounds result[i] = acc
    end
end;
julia> @benchmark loopmultsum3!($result, $coefficients, $state, $randidxs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  120.678 ΞΌs … 629.706 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     136.230 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   141.047 ΞΌs Β±  21.892 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
 [...]
 Memory estimate: 2.41 KiB, allocs estimate: 24.
1 Like

I certainly don’t mean to claim mine is the last word. Playing with memory order would be one more avenue, e.g. transposing coefficients may or may not help.

However, on my machine, this loopmultsumthreads! is 30% slower. And it gives an error with @tturbo on the outer loop (and no benefit on inner?). Perhaps I’ve been conditioned to write code simple & branch-free enough that LoopVectorization likes it.

(Good point about the bounds check, fixed above now.)

1 Like

That’s interesting. Is it an architecture thing where your machine is less sensitive to memory order and more sensitive to optimal thread utilization? I’m on Intel x86, specifically i7-7700HQ.

Or does the branch thing not get optimized on your end (no idea why that would be)? What if you try doing it by hand like this (makes no difference either way for me):

function loopmultsumthreads2!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    if (is != eachindex(result)) || ((is, js) != axes(randidxs))
        error("mismatched array sizes")
    end
    j1 = first(js)
    Threads.@threads for i in is
        @inbounds c, r = coefficients[i, j1], randidxs[i, j1]
        s = state[r]
        @inbounds result[i] = c * s
    end
    for j in @view(js[(begin + 1):end])
        Threads.@threads for i in is
            @inbounds c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            @inbounds result[i] = muladd(c, s, result[i])
        end
    end
end;

Yeah, I didn’t try LoopVectorization.

A SIMD version is slightly (10%-20%) faster on my machine. It depends on the data having 4 columns like the example in the OP (for wider data the needed overhead may make it less competitive):

using SIMD

coefficients_vec = [Vec{4, Float64}(tuple(r...)) for r in eachrow(coefficients)];
randidxs_vec = [Vec{4, Int}(tuple(r...)) for r in eachrow(randidxs)];

function loopmultsumsimd!(result, coef_vec, state, idx_vec)
    for i in eachindex(sc)
        result[i] = sum(vgather(state, idx_vec[i])*coef_vec[i])
    end
end

with these defined and the initialization from the OP:

using BenchmarkTools

n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

we get the following:

julia> viewmultsum!(result, coefficients, state, randidxs);

julia> r1 = copy(result);

julia> loopmultsumsimd!(result, coefficients_vec, state, randidxs_vec);

julia> result β‰ˆ r1     # verify correctness
true

julia> @benchmark loopmultsumsimd!($result, $coefficients_vec, state, $randidxs_vec)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  159.655 ΞΌs … 525.700 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     178.700 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   182.021 ΞΌs Β±  13.280 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–β–ƒβ–„β–ˆβ–ˆβ–‡β–‡β–‡β–ƒ                                           
  β–β–β–β–β–β–β–β–β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β– β–ƒ
  160 ΞΌs           Histogram: frequency by time          227 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Apart from using the narrow 4 column matrices which fit nicely into SIMD variables this solution also uses vgather which is efficient for indexing with randidxs.