help understanding performance with assignment to allocated matrix

A pattern I often find useful is to assign to a slice of a matrix in a loop, forgoing the innermost loop, e.g.:

for idat in 1:ndata
    array[:, idat] = [xval, yval, zval]
end

The vector of values assigned to the slice could be the output of another function, or a vector constructed as above. I mostly use this as it is convenient. However, independent of how the vector be assigned is generated, there are performance penalties with how one chooses to assign the values to the matrix.

Consider the following cases, here the vals vector is generated upfront which I hope isolates differences between the different methods:

function assign_nofuse!(array::Matrix{T}) where {T}
    dim, ndata = size(array)
    vals = [convert(T, i) for i in 1:dim]
    for idat in 1:ndata
        array[:, idat] = vals
    end
end

function assign_fused!(array::Matrix{T}) where {T}
    dim, ndata = size(array)
    vals = [convert(T, i) for i in 1:dim]
    for idat in 1:ndata
        array[:, idat] .= vals
    end
end

function assign_doubleloop!(array::Matrix{T}) where {T}
    dim, ndata = size(array)
    vals = [convert(T, i) for i in 1:dim]
    for idat in 1:ndata
        for ivar in 1:dim
            array[ivar, idat] = vals[ivar]
        end
    end
end

and timing:

julia> arr = zeros(3, 100000)

julia> @benchmark assign_nofuse!(arr)
BenchmarkTools.Trial:
  memory estimate:  3.04 MiB
  allocs estimate:  199490
  --------------
  minimum time:     1.587 ms (0.00% GC)
  median time:      1.749 ms (0.00% GC)
  mean time:        1.998 ms (3.64% GC)
  maximum time:     40.098 ms (0.00% GC)
  --------------
  samples:          2498
  evals/sample:     1

julia> @benchmark assign_fused!(arr)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     760.706 μs (0.00% GC)
  median time:      842.758 μs (0.00% GC)
  mean time:        991.368 μs (0.00% GC)
  maximum time:     3.792 ms (0.00% GC)
  --------------
  samples:          5034
  evals/sample:     1

julia> @benchmark assign_doubleloop!(arr)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     250.529 μs (0.00% GC)
  median time:      258.918 μs (0.00% GC)
  mean time:        311.633 μs (0.00% GC)
  maximum time:     1.551 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there something I am missing? This is the first time I have profiled speed differences where I use this pattern so chances are I’ve been unknowingly incurring a penalty for this style in other languages. Is there a better way to use this specific style? or just best to avoid it and individually assign values to the slice of the matrix as in assign_doubleloop!

Thanks.

You probably wouldn’t see this in other languages because loops in R/MATLAB/Python are so slow that the vectorized approach would be faster even with the temporary array. In C/C++ you wouldn’t even think of doing a vectorized approach and would just write the loop, and you’d notice your C/C++ code is quite fast compared to the vectorized R/MATLAB/Python :smile:. Julia is kind of the only place where you’d benchmark the two together, but given the fundamentals it’s clear why the loop is preferred.

Instead of using arrays of size 3, use StaticArrays.jl

These are essentially N-D numbers and Vector{SArray{...}} will pack the data just like a matrix to be efficient, but you won’t have temporaries for allocating static arrays (since they will be stack allocated) and they generate very efficient code for small N.

2 Likes

In the benchmarked code there are temporaries in all three versions and they all seem equivalent, so it seems like a missed optimization in the first two. @rmar there are some improvements with slices on 0.7 you can try. Also try playing with inbounds for the loop?

Yes, but it really shouldn’t have them. It should be a length 100000 vector of 3-dimensional SVectors.

Between 2 and 3 it’s already known and documented in

That says small loops can have about 50% performance drops. This is a little higher, but in the same range. I’m surprised the first doesn’t just lower to that though, or if it does it doesn’t do so efficiently.

That link says 50% for size one arrays, so surely something else is going on here? Re StaticArrays certainly it would be useful for vals to avoid a temp, but I don’t think converting the array itself to an array of static arrays would help.

It would get rid of

which would definitely help. Static arrays also does a lot of really nice things, more than just unrolling a few loops. It also hardcodes basic like QR factorizations in a way that makes things even more optimized just due to dispatching to more specialized algorithms. Generally, if an algorithm is always slicing small arrays out of a high dimensional array, it’s likely static arrays will be helpful.

For reference, here are the timings I get, in microseconds. 0.7 is today’s nightly. I haven’t been able to make BenchmarkTools work, so this is taking the best of 100 @time runs. For this particular case fusion seems to get worse in 0.7…

|---------------------+------+------|
|                     |  0.6 |  0.7 |
| nofuse              | 1819 | 1255 |
| fused               |  793 | 1179 |
| doubleloop          |  299 |  343 |
| doubleloop_inbounds |  289 |  230 |
|---------------------+------+------|