# Too many allocations when indexing with slices

The following code contains two equivalent implementations of a trivial algorithm. `slow_func` uses vectorized indexing and assignment while `fast_func` individually indexes and assigns each array element in a loop.

Can anyone explain why `slow_func` produces 300 allocations, while `fast_func` only produces 4? This feels like a bug to me.

If it is not a bug, then am I right to say we should never use slice indexing when performance is important?

``````function fast_func(a, b)
for i = 1:size(a, 1)
for j = 1:size(a, 2)
a[i, j] = b[i, j]
end
end
end

function slow_func(a, b)
for i = 1:size(a, 1)
a[i, :] = b[i, :]
end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time slow_func(a, b)
@time slow_func(a, b)

@time fast_func(a, b)
@time fast_func(a, b)
``````
2 Likes

I ran it twice with --track-allocation=user, and got the following. It’s a mystery to me as I have seen this type of thing in my code. Hope someone can explain it.

``````        - function fast_func(a, b)
1967482   for i = 1:size(a, 1)
0     for j = 1:size(a, 2)
0       a[i, j] = b[i, j]
-     end
-   end
- end
-
- function slow_func(a, b)
11307882   for i = 1:size(a, 1)
115200     a[i, :] = b[i, :]
-   end
- end
-
- a = zeros(300, 1)
- b = zeros(300, 1)
- @time slow_func(a, b)
- @time slow_func(a, b)
-
- @time fast_func(a, b)
- @time fast_func(a, b)
``````
1 Like

I didn’t know about --track-allocation=user, thank you for that.

1 Like

Slicing creates a copy. `view` can be used instead. Allocation of `view`s can often be elided by the compiler.

No, depending on what you do with the data, copying it into continuous memory might be worth it. There is also a difference in semantics so you cannot always swap one with the other.

1 Like

Slicing creates a copy. `view` can be used instead.

Thank you! I will try this.

copying it into continuous memory might be worth it

I think I understand. Are you basically saying that copies and views are both useful so we should have both? I agree with that. However I find it odd that slice notation always creates copies instead of views, even when the compiler could in theory figure out that views will work for a certain case.

I found the answer to my question in the manual.

If you are doing many operations on the slice, this can be good for performance because it is more efficient to work with a smaller contiguous copy than it would be to index into the original array.

@kristoffer.carlsson How would you rewrite `slow_func` to copy without allocating? I tried the following two ways of using views but neither reduces the number of allocations.

Because the operations I’m performing are simple (trivial actually, just copying without doing any operations), I don’t think I stand to benefit from using slices to copy sub-arrays into contiguous memory. It should be faster for me to copy directly from source array to destination array.

``````function f(a, b)
for i = 1:size(a, 1)
a[i, :] = view(b, i, :)
end
end

function g(a, b)
for i = 1:size(a, 1)
@views a[i, :] = b[i, :]
end
end

a = zeros(300, 1)
b = zeros(300, 1)
@time f(a, b)
@time f(a, b)

@time g(a, b)
@time g(a, b)
``````

julia --track-allocations=user tmp.jl && cat tmp.jl.mem

``````        - function f(a, b)
55498125   for i = 1:size(a, 1)
28800     a[i, :] = view(b, i, :)
-   end
- end
-
- function g(a, b)
1328478   for i = 1:size(a, 1)
28800     @views a[i, :] = b[i, :]
-   end
- end
-
- a = zeros(300, 1)
- b = zeros(300, 1)
- @time f(a, b)
- @time f(a, b)
-
- @time g(a, b)
- @time g(a, b)
-
``````

Correction, it actually reduces allocations substantially (115200 -> 28800) but does not eliminate them.

Can anyone explain the allocation on the for statements?

`````` 55498125   for i = 1:size(a, 1)

1328478   for i = 1:size(a, 1)
``````
3 Likes
``````julia> using BenchmarkTools, UnsafeArrays

julia> @btime f(\$a, \$b) # f, g, a, and b defined as in opening post
2.536 μs (300 allocations: 14.06 KiB)

julia> @btime g(\$a, \$b)
2.560 μs (300 allocations: 14.06 KiB)

julia> function h(a, b)
for i = 1:size(a, 1)
a[i, :] = uview(b, i, :)
end
end
h (generic function with 1 method)

julia> @btime h(\$a, \$b)
1.863 μs (0 allocations: 0 bytes)

julia> h3(a, b) = (a .= b)
h3 (generic function with 1 method)

julia> @btime h3(\$a, \$b);
48.704 ns (0 allocations: 0 bytes)
``````

So the unsafe arrays are still slower than possible.

Hello.

Well, I cannot resist although I fear it is not too helpful. I am not sure what the compiler does but using StaticArrays gives

``````julia> A=@MMatrix zeros(300,1) ;
julia> B=@MMatrix zeros(300,1) ;
julia> @btime fast_func(\$A, \$B)
33.169 ns (0 allocations: 0 bytes)
julia> @btime slow_func(\$A, \$B)
32.273 ns (0 allocations: 0 bytes)
julia> @btime f(\$A,\$B)
113.117 ns (0 allocations: 0 bytes)
julia> @btime g(\$A,\$B)
113.197 ns (0 allocations: 0 bytes)
``````

And that with a size which is bigger than the recommendation of 100 elements, so I assume that the example is to trivial ? But yes, I have noticed that writing f77 is often not a bad idea with julia, although it might be detrimental for the readability so has to be used with discretion.

Best Regards

I think row by row copying is slow anyway because of cache misses. It can be that views by row cannot be
efficiently elided.

``````julia> using BenchmarkTools

julia> function fcolumn(a, b)
for i = 1:size(a, 2)
a[:, i] = view(b, :, i)
end
end

julia> @btime fcolumn(\$a,\$b)
374.663 ns (1 allocation: 48 bytes)

julia> @btime f(\$a,\$b)
4.002 μs (300 allocations: 14.06 KiB)
``````

Right ! But still (please note the changed size of A and B)

``````julia> function r_fast_func(a, b)
for i = 1:size(a, 1)
for j = 1:size(a, 2)
a[i, j] = b[i, j]
end
end
end
julia> function c_fast_func(a, b)
for j = 1:size(a, 2)
for i = 1:size(a, 1)
a[i, j] = b[i, j]
end
end
end
julia> B=zeros(300,1);
julia> A=zeros(300,1);
julia> @btime r_fast_func(\$A, \$B)
337.768 ns (0 allocations: 0 bytes)
julia> @btime c_fast_func(\$A, \$B)
172.458 ns (0 allocations: 0 bytes)
julia> @btime fcolumn(\$A, \$B)
344.550 ns (1 allocation: 48 bytes)
julia> A=zeros(300,300);
julia> B=zeros(300,300);
julia> @btime r_fast_func(\$A, \$B)
142.011 μs (0 allocations: 0 bytes)
julia> @btime c_fast_func(\$A, \$B)
50.468 μs (0 allocations: 0 bytes)
julia> @btime fcolumn(\$A, \$B)
101.129 μs (300 allocations: 14.06 KiB)
``````

I think 344 ns *300=103 us, so that is consistent at least.

I see. It looks like `setindex!` method for views is somewhat inefficient and does not yield any better with `@inbounds` too.

Thank you everyone for the answers so far. `UnsafeArrays`, `StaticArrays`, and iterating in column order are all good options for eliminating allocations.

I still don’t understand why using `view` results in the extra allocations, and I don’t think it should. We should be able to slice an array (even a complicated, inefficient slice), and assign each of its elements into an existing array, without allocating an intermediate array. Maybe this isn’t a bug with `view` per se, but it seems like an optimization that the compiler should be able to make.

That said, I have benchmarked the equivalents of `fast_func` and `slow_func` in my application which processes very large arrays, and although `slow_func` performs O(N) allocations while `fast_func` performs O(1) allocations, `fast_func` isn’t actually faster, they take the same amount of time. So maybe there is a good reason for this behavior that has not been mentioned yet, or for some reason the “optimization” just isn’t necessary.

1 Like

Wild speculation: The examples above surely fit into the processors L2/L3 caches, eventually even L1. If you overflow the cache and have to fetch from normal RAM I would not be surprised of qualitative changes (also depending on the prefetcher logic of the processor).

Well, may be someone with actual background knowledge might add a comment later I hope. I also found the views situation when using small matrices a bit confusing.

1 Like

Thank you !