Too many allocations when indexing with slices

noahstier · August 15, 2018, 11:14pm

The following code contains two equivalent implementations of a trivial algorithm. slow_func uses vectorized indexing and assignment while fast_func individually indexes and assigns each array element in a loop.

Can anyone explain why slow_func produces 300 allocations, while fast_func only produces 4? This feels like a bug to me.

If it is not a bug, then am I right to say we should never use slice indexing when performance is important?

function fast_func(a, b)
  for i = 1:size(a, 1)
    for j = 1:size(a, 2)
      a[i, j] = b[i, j]
    end
  end
end

function slow_func(a, b)
  for i = 1:size(a, 1)
    a[i, :] = b[i, :]
  end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time slow_func(a, b)
@time slow_func(a, b)

@time fast_func(a, b)
@time fast_func(a, b)

LeoK987 · August 15, 2018, 11:47pm

I ran it twice with --track-allocation=user, and got the following. It’s a mystery to me as I have seen this type of thing in my code. Hope someone can explain it.

        - function fast_func(a, b)
  1967482   for i = 1:size(a, 1)
        0     for j = 1:size(a, 2)
        0       a[i, j] = b[i, j]
        -     end
        -   end
        - end
        - 
        - function slow_func(a, b)
 11307882   for i = 1:size(a, 1)
   115200     a[i, :] = b[i, :]
        -   end
        - end
        - 
        - a = zeros(300, 1)
        - b = zeros(300, 1)
        - @time slow_func(a, b)
        - @time slow_func(a, b)
        - 
        - @time fast_func(a, b)
        - @time fast_func(a, b)

noahstier · August 15, 2018, 11:50pm

I didn’t know about --track-allocation=user, thank you for that.

kristoffer.carlsson · August 15, 2018, 11:51pm

Slicing creates a copy. view can be used instead. Allocation of views can often be elided by the compiler.

No, depending on what you do with the data, copying it into continuous memory might be worth it. There is also a difference in semantics so you cannot always swap one with the other.

noahstier · August 15, 2018, 11:58pm

Slicing creates a copy. view can be used instead.

Thank you! I will try this.

copying it into continuous memory might be worth it

I think I understand. Are you basically saying that copies and views are both useful so we should have both? I agree with that. However I find it odd that slice notation always creates copies instead of views, even when the compiler could in theory figure out that views will work for a certain case.

noahstier · August 16, 2018, 12:03am

I found the answer to my question in the manual.

If you are doing many operations on the slice, this can be good for performance because it is more efficient to work with a smaller contiguous copy than it would be to index into the original array.

noahstier · August 16, 2018, 12:19am

@kristoffer.carlsson How would you rewrite slow_func to copy without allocating? I tried the following two ways of using views but neither reduces the number of allocations.

Because the operations I’m performing are simple (trivial actually, just copying without doing any operations), I don’t think I stand to benefit from using slices to copy sub-arrays into contiguous memory. It should be faster for me to copy directly from source array to destination array.

function f(a, b)
  for i = 1:size(a, 1)
    a[i, :] = view(b, i, :)
  end
end

function g(a, b)
  for i = 1:size(a, 1)
    @views a[i, :] = b[i, :]
  end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time f(a, b)
@time f(a, b)

@time g(a, b)
@time g(a, b)

julia --track-allocations=user tmp.jl && cat tmp.jl.mem

        - function f(a, b)
 55498125   for i = 1:size(a, 1)
    28800     a[i, :] = view(b, i, :)
        -   end
        - end
        - 
        - function g(a, b)
  1328478   for i = 1:size(a, 1)
    28800     @views a[i, :] = b[i, :]
        -   end
        - end
        - 
        - a = zeros(300, 1)
        - b = zeros(300, 1)
        - @time f(a, b)
        - @time f(a, b)
        - 
        - @time g(a, b)
        - @time g(a, b)
        -

Correction, it actually reduces allocations substantially (115200 → 28800) but does not eliminate them.

LeoK987 · August 16, 2018, 1:31am

Can anyone explain the allocation on the for statements?

 55498125   for i = 1:size(a, 1)

  1328478   for i = 1:size(a, 1)

Elrod · August 16, 2018, 1:49am

julia> using BenchmarkTools, UnsafeArrays

julia> @btime f($a, $b) # f, g, a, and b defined as in opening post
  2.536 μs (300 allocations: 14.06 KiB)

julia> @btime g($a, $b)
  2.560 μs (300 allocations: 14.06 KiB)

julia> function h(a, b)
         for i = 1:size(a, 1)
           a[i, :] = uview(b, i, :)
         end
       end
h (generic function with 1 method)

julia> @btime h($a, $b)
  1.863 μs (0 allocations: 0 bytes)

julia> h3(a, b) = (a .= b)
h3 (generic function with 1 method)

julia> @btime h3($a, $b);
  48.704 ns (0 allocations: 0 bytes)

So the unsafe arrays are still slower than possible.

ckoe-bccms · August 16, 2018, 7:42am

Hello.

Well, I cannot resist although I fear it is not too helpful. I am not sure what the compiler does but using StaticArrays gives

julia> A=@MMatrix zeros(300,1) ;
julia> B=@MMatrix zeros(300,1) ;
julia> @btime fast_func($A, $B)
  33.169 ns (0 allocations: 0 bytes)
julia> @btime slow_func($A, $B)
  32.273 ns (0 allocations: 0 bytes)
julia> @btime f($A,$B)
  113.117 ns (0 allocations: 0 bytes)
julia> @btime g($A,$B)
  113.197 ns (0 allocations: 0 bytes)

And that with a size which is bigger than the recommendation of 100 elements, so I assume that the example is to trivial ? But yes, I have noticed that writing f77 is often not a bad idea with julia, although it might be detrimental for the readability so has to be used with discretion.

Best Regards

tomaklutfu · August 16, 2018, 7:42am

I think row by row copying is slow anyway because of cache misses. It can be that views by row cannot be
efficiently elided.

julia> using BenchmarkTools

julia> function fcolumn(a, b)
           for i = 1:size(a, 2)
               a[:, i] = view(b, :, i)
           end
       end

julia> @btime fcolumn($a,$b)
  374.663 ns (1 allocation: 48 bytes)

julia> @btime f($a,$b)
  4.002 μs (300 allocations: 14.06 KiB)

ckoe-bccms · August 16, 2018, 7:55am

Right ! But still (please note the changed size of A and B)

julia> function r_fast_func(a, b)
         for i = 1:size(a, 1)
           for j = 1:size(a, 2)
             a[i, j] = b[i, j]
           end
         end
       end
julia> function c_fast_func(a, b)
         for j = 1:size(a, 2)
           for i = 1:size(a, 1)
             a[i, j] = b[i, j]
           end
         end
       end
julia> B=zeros(300,1);
julia> A=zeros(300,1);
julia> @btime r_fast_func($A, $B)
  337.768 ns (0 allocations: 0 bytes)
julia> @btime c_fast_func($A, $B)
  172.458 ns (0 allocations: 0 bytes)
julia> @btime fcolumn($A, $B)
  344.550 ns (1 allocation: 48 bytes)
julia> A=zeros(300,300);
julia> B=zeros(300,300);
julia> @btime r_fast_func($A, $B)
  142.011 μs (0 allocations: 0 bytes)
julia> @btime c_fast_func($A, $B)
  50.468 μs (0 allocations: 0 bytes)
julia> @btime fcolumn($A, $B)
  101.129 μs (300 allocations: 14.06 KiB)

I think 344 ns *300=103 us, so that is consistent at least.

tomaklutfu · August 16, 2018, 8:32am

I see. It looks like setindex! method for views is somewhat inefficient and does not yield any better with @inbounds too.

noahstier · August 16, 2018, 4:45pm

Thank you everyone for the answers so far. UnsafeArrays, StaticArrays, and iterating in column order are all good options for eliminating allocations.

I still don’t understand why using view results in the extra allocations, and I don’t think it should. We should be able to slice an array (even a complicated, inefficient slice), and assign each of its elements into an existing array, without allocating an intermediate array. Maybe this isn’t a bug with view per se, but it seems like an optimization that the compiler should be able to make.

That said, I have benchmarked the equivalents of fast_func and slow_func in my application which processes very large arrays, and although slow_func performs O(N) allocations while fast_func performs O(1) allocations, fast_func isn’t actually faster, they take the same amount of time. So maybe there is a good reason for this behavior that has not been mentioned yet, or for some reason the “optimization” just isn’t necessary.

ckoe-bccms · August 17, 2018, 7:04am

Wild speculation: The examples above surely fit into the processors L2/L3 caches, eventually even L1. If you overflow the cache and have to fetch from normal RAM I would not be surprised of qualitative changes (also depending on the prefetcher logic of the processor).

Well, may be someone with actual background knowledge might add a comment later I hope. I also found the views situation when using small matrices a bit confusing.

tim.holy · August 17, 2018, 10:13am

See performance - Unexpected memory allocation when using array views (julia) - Stack Overflow

ckoe-bccms · August 17, 2018, 11:07am

Thank you !

Topic		Replies	Views
Is there a way to guarantee 0 allocations when accessing an array? General Usage	18	4307	June 1, 2017
View and Slices: comparison of speed Performance question	4	1118	January 27, 2018
Array views becoming dominant source of memory allocation Performance	31	5171	November 14, 2018
Preallocating the result of getindex(...) General Usage question	8	1389	July 13, 2017
Why does the following setindex call allocate? Performance indexing , array	6	470	June 12, 2023

Too many allocations when indexing with slices

Related topics