Array views becoming dominant source of memory allocation


#1

Hi All,

I need to process 2D matrix data in short blocks, for example:

function times_one!(x::SubArray{Float64,2})
  @. x= 1.0*x
end

function test(dim_long,dim_short,step)
   x= rand(dim_long,dim_short)
   @time for i = 1:step:dim_long
      times_one!(view(x,i:(i+step-1),:))
   end
end

A few test results:

julia> test(8192,4,2);test(8192,4,64)
         0.000271 seconds (4.10 k allocations: 256.000 KiB)
         0.000030 seconds (128 allocations: 8.000 KiB)

It seems that the memory usage reported by @time is directly proportional to the inverse of the step size, which is the dominant effect on the run-time. Whilst these numbers and functions are of course only for example, this looks to be the dominant performance limitation in my code base.

What are some possible workarounds? I can’t avoid calling my library in small blocks.

Kind Regards,
Peter.


Propose for a movable SubArray
#2

x= 1.0*x is (a) not type-stable and (b) allocates a whole new array for the result. Try operating in-place with @. x= 1.0*x.


#3

Let me rearrange your initial code:

function times_one(x::SubArray{Float64,2})
    x = 1.0 * x
end

function test(x, step)
    for i = 1:step:size(x)[1]
        times_one(view(x,i:(i+step-1),:))
    end
end

Benchmark:

julia> using BenchmarkTools

julia> @benchmark test($(rand(8192, 4)), 4)
BenchmarkTools.Trial: 
  memory estimate:  544.00 KiB
  allocs estimate:  4096
  --------------
  minimum time:     122.094 μs (0.00% GC)
  median time:      128.852 μs (0.00% GC)
  mean time:        179.103 μs (24.35% GC)
  maximum time:     2.256 ms (88.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

Try this version (note that rows and columns are swapped, but I don’t know if you can do that in your code):

function times_one!(x::SubArray{Float64,2})
    x .= 1.0 .* x
end

function test(x, step)
    for i = 1:step:size(x)[2]
        times_one!(view(x,:,i:(i+step-1)))
    end
end

Benchmark:

julia> @benchmark test($(rand(4, 8192)), 4)
BenchmarkTools.Trial: 
  memory estimate:  128.00 KiB
  allocs estimate:  2048
  --------------
  minimum time:     73.214 μs (0.00% GC)
  median time:      77.109 μs (0.00% GC)
  mean time:        90.426 μs (11.57% GC)
  maximum time:     2.248 ms (90.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

Edit: I’m not sure the second version gives the same results with just axes swapped :grimacing: However, the change to times_one! is sensible (the same given in the answer above).


#4

Hi stevengj/giordano,

The @. was included in the benchmarks I ran, I accidentally deleted it when trying (and failing) to quote the code properly.

I am aware that it is likely better to transpose the matrix and swap the view directions, nevertheless, it does not change my benchmarks for this example.

Apologies.


#5

I confirm that my second version of the code does the same as your code with just rows and columns swapped. I was using different random matrices to compare the results, of course they didn’t match.


#6

Ultimately, if you inner loop is calling lots of cheap operations on little blocks, like times_one!, then for best performance you need to just inline the operation and don’t create a view. Just write an explicit loop over the little block.


#7

I ran into a similar issue when writing https://github.com/rdeits/NNLS.jl and ended up adapting some code by @tkoolen so that I could create unsafe but allocation-free views for use inside inner loops like this. I posted about it over ate Marching cubes without extra allocations


#8

My attempt at speeding up such operations by using unsafe views failed at the speed of pointer-derefs for non-bitstypes; see https://discourse.julialang.org/t/speed-and-type-stability-of-unsafe-pointer-to-objref/6478.

The code there gets rid of the spurious allocations for the views, but pays too much on pointer-derefs.

This is somewhat similar to the other approach to unsafe views, but I changed the array type instead of mucking with SubArray. The code by @rdeits and @tkoolen is faster because you don’t pay overhead on unsafe_load / unsafe_store!.


#9

Yeah, I decided not to bother with unsafe views for non-bitstypes (since non-bits array elements ended up making other parts of the code slow enough that the views were not the dominant factor. See https://github.com/rdeits/NNLS.jl/blob/b7314fb9691a9d4ec9897316c8732f9aba94ed47/src/NNLS.jl#L224 (this code looks type-unstable, but the compiler is smart enough to evaluate isbits() at compile-time, so it’s actually inferrable).


#10

In this specific case, if you @inline times_one!, then the compiler is able to see that the view wrapper itself isn’t needed… and as a result there are no allocations at all. That might just be an artifact of how you minimalized your example, though.


#11

Honestly, I take the fact that views allocate (a tiny amount) as just a small language bug right now and wouldn’t worry too much about working around it. Do a workaround if you really need it via a loop, but I wouldn’t mangle any package code and would just wait for the 1.x that “fixes” it.


#12

Yeah, I agree with that sentiment in general, but for my particular use case, the Views were the only objects in my entire loop that allocated, so eliminating that allocation at every iteration made a huge performance difference and was critical to getting my Julia code as fast as the Fortran code it was based on.


#13

I am not sure that this is fixable at all without changing the API. If we keep the API, then views must keep their parents alive, and cannot be bitstype, hence need heap-allocs. But sure, stack allocation/escape analysis can make the heap-allocs more rare.

My personal minimally invasive library-level solution would be to just make views mutable. The current escape analysis should still be possible to stack-alloc views in inlined loops, and if you really care about the allocs then you could use the usual pattern where you pass a pre-allocated tmp_view around (one per thread*recursion-level), and do view!(myview, array, ranges…) to reuse instead of myview=view(array, ranges…) for cases where you don’t care about the alloc.

The optimal maximally invasive solution would probably be to make immutable types behave more like bitstypes: That is, Vector{T} is guaranteed to be contiguous layout and not pointer-type whenever T is immutable, and (known type) immutable is guaranteed to never heap-allocate, regardless of its fields. Problem: very hard on the garbage collector.

To be honest, I don’t entirely understand the point of immutable non-bitstypes existing in the language at all, with today’s compiler. There are bitstypes and non-bitstypes; whether your non-bitstype is mutable appears not to make a lot of difference in the generated code (please correct me if I’m wrong!).


#14

Yes it is.

Yes.

No.


#15

I have faced this problem as well. You can decrese the allocs estimate to one with the following code:

function test2(x, step)
    v= view(x,:,1:(1+step-1))
    for i = 1:step:size(x)[2]
	    v.= view(x,:,i:(i+step-1))
        times_one!(v)
    end
end

Nevertheless it is slower

@benchmark test2($(rand(4, 8192)), 4)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  1
  --------------
  minimum time:     140.892 μs (0.00% GC)
  median time:      148.780 μs (0.00% GC)
  mean time:        149.943 μs (0.00% GC)
  maximum time:     331.402 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
@benchmark test($(rand(4, 8192)), 4)
BenchmarkTools.Trial: 
  memory estimate:  128.00 KiB
  allocs estimate:  2048
  --------------
  minimum time:     77.060 μs (0.00% GC)
  median time:      81.193 μs (0.00% GC)
  mean time:        85.780 μs (3.89% GC)
  maximum time:     798.825 μs (83.04% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there any way to define a view once and update it inplace without taking a performance penalty like this?

I’ve faced this situation when implementing an iterative algorithm that loops over a big array many times. I we loop one million times the code x = view(X,:,i) will create one million views. Not sure how to avoid this if I don’t create the view before and update it inplace (and If I do I get a big performance hit).


#16

That’s not actually doing what you intend to do; it’s repeatedly overwriting the same region of the array…

I’ve done something similar in cases where this was in a super-hot loop. Of course, given who I am, I used a ton of the internal/unstable/undocumented indexing machinery to hack together a simple in-place getindex! function. If you have such a function then you can:

row = A[1, :] # create a temporary array of the right shape
for i in indices(A, 1)
    getindex!(row, A, i, :) # effectively row .= A[i, :]
    … = process(row)
end

Of course, if you want to also mutate A then you’d have to assign the row back into it. It still may be a net gain, especially if you gain cache-friendliness by copying the data and processing it in one contiguous chunk.

I’ve not submitted my getindex! for inclusion in base simply because it’s a new feature with a complicated signature… and I’ve been working on breaking changes instead.


#17

Blockquote
Is there any way to define a view once and update it inplace without taking a performance penalty like this?
I’ve faced this situation when implementing an iterative algorithm that loops over a big array many times. I we loop one million times the code x = view(X,:,i) will create one million views. Not sure how to avoid this if I don’t create the view before and update it inplace (and If I do I get a big performance hit).

I worked around this via creating a sliding vector type.
If there’s interest, I could separate the code out into a package and clean it up.

Example as is:

julia> using StaticArrays, BenchmarkTools

julia> import Base: getindex, view, setindex!

julia> include("...path/to/that/file") 

julia> function times_one!(x)
          @. x= 1.0*x
       end
times_two! (generic function with 1 method)

julia> function times_two!(x)
          @. x= 2.0*x
       end
times_two! (generic function with 1 method)

julia> step = 500;

julia> x = rand(500, 1000); xc = copy(x);

julia> sv = SlidingVec(xc, -step, step);

julia> svf = SlidingVecFun(() -> times_two!(sv), sv);

julia> test1 = svf();

julia> test2 = times_two!(view(x, :, 1));

julia> all(test1 .== test2)
true

julia> all(svf() .== times_two!(view(x, :, 2))) ### The svf object automatically slides after each call.
true

julia> all(svf() .== times_two!(view(x, :, 3)))
true

julia> all(svf() .== times_two!(view(x, :, 4)))
true

julia> function test_view(x, step)
           for i = 1:size(x,2)
               times_one!(view(x,:,i))
           end
       end
test_view (generic function with 1 method)

julia> function test_svf(x, step)
           sv = SlidingVec(x, -step, step)
           svf = SlidingVecFun(() -> times_one!(sv), sv)
           for i ∈ 1:size(x,2)
               svf()
           end
       end
test_svf (generic function with 1 method)

julia> @benchmark test_view($x, 500)
BenchmarkTools.Trial: 
  memory estimate:  46.88 KiB
  allocs estimate:  1000
  --------------
  minimum time:     11.977 μs (0.00% GC)
  median time:      14.722 μs (0.00% GC)
  mean time:        18.651 μs (15.71% GC)
  maximum time:     1.725 ms (93.56% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark test_svf($x, 500)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     8.292 μs (0.00% GC)
  median time:      9.006 μs (0.00% GC)
  mean time:        9.094 μs (0.00% GC)
  maximum time:     20.138 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

EDIT:
Note that because times_two! is mutating, the original parent array mutates.


#18

@mbauman I still don’t fully understand why it’s not doing what I think it´s doing.

Let us consider the following example

I define an array X2

X = ones(10,5)
x = Array(1:10);
X2 = x.+X
10×5 Array{Float64,2}:
  2.0   2.0   2.0   2.0   2.0
  3.0   3.0   3.0   3.0   3.0
  4.0   4.0   4.0   4.0   4.0
  5.0   5.0   5.0   5.0   5.0
  6.0   6.0   6.0   6.0   6.0
  7.0   7.0   7.0   7.0   7.0
  8.0   8.0   8.0   8.0   8.0
  9.0   9.0   9.0   9.0   9.0
 10.0  10.0  10.0  10.0  10.0
 11.0  11.0  11.0  11.0  11.0

I do some computations using a view of each row

for i in 1:size(X2,1)
    v = view(X2,i,:)
    X2[i,:] += v
end
X2

since I add to each row a view of the same row I double the initial array

10×5 Array{Float64,2}:
  4.0   4.0   4.0   4.0   4.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

I define again the arrays.

X = ones(10,5)
x = Array(1:10);
X2 = x.+X

Now I do the inplace aray

v = view(X2,1,:)
for i in 1:size(X2,1)
    v .= view(X2,i,:)
    X2[i,:] += v
end
X2
10×5 Array{Float64,2}:
 11.0  11.0  11.0  11.0  11.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

Which at the first row is giving an incorrect result but the other rows seem correct.

It seems that with an auxiliar row vector I can solve this (yet I still allocate now a vector, which might be less than lots of views but not optimal)

X = ones(10,5)
x = Array(1:10);
X2 = x.+X

v = zeros(X2[1,:])
for i in 1:size(X2,1)
    v .= view(X2,i,:)
    X2[i,:] += v
end

#19

Yes, this causes the first row to be wrong because you’re repeatedly assigning into it. v .= … doesn’t change where the view points, it modifies the data that view is pointing to. That’s your first row.


#20

The code with row .= A[i, :] allocates memory for every i if your getindex!avoids this issue it should be in Julia.Base!