Too many allocations when indexing with slices

The following code contains two equivalent implementations of a trivial algorithm. slow_func uses vectorized indexing and assignment while fast_func individually indexes and assigns each array element in a loop.

Can anyone explain why slow_func produces 300 allocations, while fast_func only produces 4? This feels like a bug to me.

If it is not a bug, then am I right to say we should never use slice indexing when performance is important?

function fast_func(a, b)
  for i = 1:size(a, 1)
    for j = 1:size(a, 2)
      a[i, j] = b[i, j]
    end
  end
end

function slow_func(a, b)
  for i = 1:size(a, 1)
    a[i, :] = b[i, :]
  end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time slow_func(a, b)
@time slow_func(a, b)

@time fast_func(a, b)
@time fast_func(a, b)
2 Likes

I ran it twice with --track-allocation=user, and got the following. It’s a mystery to me as I have seen this type of thing in my code. Hope someone can explain it.

        - function fast_func(a, b)
  1967482   for i = 1:size(a, 1)
        0     for j = 1:size(a, 2)
        0       a[i, j] = b[i, j]
        -     end
        -   end
        - end
        - 
        - function slow_func(a, b)
 11307882   for i = 1:size(a, 1)
   115200     a[i, :] = b[i, :]
        -   end
        - end
        - 
        - a = zeros(300, 1)
        - b = zeros(300, 1)
        - @time slow_func(a, b)
        - @time slow_func(a, b)
        - 
        - @time fast_func(a, b)
        - @time fast_func(a, b)
1 Like

I didn’t know about --track-allocation=user, thank you for that.

1 Like

Slicing creates a copy. view can be used instead. Allocation of views can often be elided by the compiler.

No, depending on what you do with the data, copying it into continuous memory might be worth it. There is also a difference in semantics so you cannot always swap one with the other.

1 Like

Slicing creates a copy. view can be used instead.

Thank you! I will try this.

copying it into continuous memory might be worth it

I think I understand. Are you basically saying that copies and views are both useful so we should have both? I agree with that. However I find it odd that slice notation always creates copies instead of views, even when the compiler could in theory figure out that views will work for a certain case.

I found the answer to my question in the manual.

If you are doing many operations on the slice, this can be good for performance because it is more efficient to work with a smaller contiguous copy than it would be to index into the original array.

@kristoffer.carlsson How would you rewrite slow_func to copy without allocating? I tried the following two ways of using views but neither reduces the number of allocations.

Because the operations I’m performing are simple (trivial actually, just copying without doing any operations), I don’t think I stand to benefit from using slices to copy sub-arrays into contiguous memory. It should be faster for me to copy directly from source array to destination array.

function f(a, b)
  for i = 1:size(a, 1)
    a[i, :] = view(b, i, :)
  end
end

function g(a, b)
  for i = 1:size(a, 1)
    @views a[i, :] = b[i, :]
  end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time f(a, b)
@time f(a, b)

@time g(a, b)
@time g(a, b)

julia --track-allocations=user tmp.jl && cat tmp.jl.mem

        - function f(a, b)
 55498125   for i = 1:size(a, 1)
    28800     a[i, :] = view(b, i, :)
        -   end
        - end
        - 
        - function g(a, b)
  1328478   for i = 1:size(a, 1)
    28800     @views a[i, :] = b[i, :]
        -   end
        - end
        - 
        - a = zeros(300, 1)
        - b = zeros(300, 1)
        - @time f(a, b)
        - @time f(a, b)
        - 
        - @time g(a, b)
        - @time g(a, b)
        - 

Correction, it actually reduces allocations substantially (115200 → 28800) but does not eliminate them.

Can anyone explain the allocation on the for statements?

 55498125   for i = 1:size(a, 1)

  1328478   for i = 1:size(a, 1)
3 Likes
julia> using BenchmarkTools, UnsafeArrays

julia> @btime f($a, $b) # f, g, a, and b defined as in opening post
  2.536 ÎĽs (300 allocations: 14.06 KiB)

julia> @btime g($a, $b)
  2.560 ÎĽs (300 allocations: 14.06 KiB)

julia> function h(a, b)
         for i = 1:size(a, 1)
           a[i, :] = uview(b, i, :)
         end
       end
h (generic function with 1 method)

julia> @btime h($a, $b)
  1.863 ÎĽs (0 allocations: 0 bytes)

julia> h3(a, b) = (a .= b)
h3 (generic function with 1 method)

julia> @btime h3($a, $b);
  48.704 ns (0 allocations: 0 bytes)

So the unsafe arrays are still slower than possible.

Hello.

Well, I cannot resist although I fear it is not too helpful. I am not sure what the compiler does but using StaticArrays gives

julia> A=@MMatrix zeros(300,1) ;
julia> B=@MMatrix zeros(300,1) ;
julia> @btime fast_func($A, $B)
  33.169 ns (0 allocations: 0 bytes)
julia> @btime slow_func($A, $B)
  32.273 ns (0 allocations: 0 bytes)
julia> @btime f($A,$B)
  113.117 ns (0 allocations: 0 bytes)
julia> @btime g($A,$B)
  113.197 ns (0 allocations: 0 bytes)

And that with a size which is bigger than the recommendation of 100 elements, so I assume that the example is to trivial ? But yes, I have noticed that writing f77 is often not a bad idea with julia, although it might be detrimental for the readability so has to be used with discretion.

Best Regards

I think row by row copying is slow anyway because of cache misses. It can be that views by row cannot be
efficiently elided.

julia> using BenchmarkTools

julia> function fcolumn(a, b)
           for i = 1:size(a, 2)
               a[:, i] = view(b, :, i)
           end
       end

julia> @btime fcolumn($a,$b)
  374.663 ns (1 allocation: 48 bytes)

julia> @btime f($a,$b)
  4.002 ÎĽs (300 allocations: 14.06 KiB)

Right ! But still (please note the changed size of A and B)

julia> function r_fast_func(a, b)
         for i = 1:size(a, 1)
           for j = 1:size(a, 2)
             a[i, j] = b[i, j]
           end
         end
       end
julia> function c_fast_func(a, b)
         for j = 1:size(a, 2)
           for i = 1:size(a, 1)
             a[i, j] = b[i, j]
           end
         end
       end
julia> B=zeros(300,1);
julia> A=zeros(300,1);
julia> @btime r_fast_func($A, $B)
  337.768 ns (0 allocations: 0 bytes)
julia> @btime c_fast_func($A, $B)
  172.458 ns (0 allocations: 0 bytes)
julia> @btime fcolumn($A, $B)
  344.550 ns (1 allocation: 48 bytes)
julia> A=zeros(300,300);
julia> B=zeros(300,300);
julia> @btime r_fast_func($A, $B)
  142.011 ÎĽs (0 allocations: 0 bytes)
julia> @btime c_fast_func($A, $B)
  50.468 ÎĽs (0 allocations: 0 bytes)
julia> @btime fcolumn($A, $B)
  101.129 ÎĽs (300 allocations: 14.06 KiB)

I think 344 ns *300=103 us, so that is consistent at least.

I see. It looks like setindex! method for views is somewhat inefficient and does not yield any better with @inbounds too.

Thank you everyone for the answers so far. UnsafeArrays, StaticArrays, and iterating in column order are all good options for eliminating allocations.

I still don’t understand why using view results in the extra allocations, and I don’t think it should. We should be able to slice an array (even a complicated, inefficient slice), and assign each of its elements into an existing array, without allocating an intermediate array. Maybe this isn’t a bug with view per se, but it seems like an optimization that the compiler should be able to make.

That said, I have benchmarked the equivalents of fast_func and slow_func in my application which processes very large arrays, and although slow_func performs O(N) allocations while fast_func performs O(1) allocations, fast_func isn’t actually faster, they take the same amount of time. So maybe there is a good reason for this behavior that has not been mentioned yet, or for some reason the “optimization” just isn’t necessary.

1 Like

Wild speculation: The examples above surely fit into the processors L2/L3 caches, eventually even L1. If you overflow the cache and have to fetch from normal RAM I would not be surprised of qualitative changes (also depending on the prefetcher logic of the processor).

Well, may be someone with actual background knowledge might add a comment later I hope. I also found the views situation when using small matrices a bit confusing.

See performance - Unexpected memory allocation when using array views (julia) - Stack Overflow

1 Like

Thank you !