View or copy: looking for an overview

Paul_Soderlind · June 9, 2022, 12:15pm

I am looking for an overview/recommendations on the classical view or copy issue. Any links?

Background: even in a simple case like

sum(x[ind]) versus sum(view(x,ind)),

the former seems to be faster in many cases (eg. when ind is a BitVector or a StepRange), except when ind is a UnitRange. Replace sum with something more complicated and it becomes even clearer.

heliosdrm · June 9, 2022, 12:23pm

The basic recommendation on this topic in the Julia Manual is here:

https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-views

jishnub · June 9, 2022, 12:23pm

The difference is that the slice will likely be contiguous in memory, but the view may be non-contiguous. Operations involving the former will likely have fewer cache misses. However, this has to be balanced against the time required to create the slice in the first place. In many applications, using a view will be faster, especially if the resulting view is strided or contiguous in memory.

julia> x = rand(1000);

julia> inds = 300:800;

julia> using BenchmarkTools

julia> @btime sum($x[$inds]);
  650.846 ns (1 allocation: 4.06 KiB)

julia> @btime sum(@view $x[$inds]);
  67.950 ns (0 allocations: 0 bytes)

julia> inds = 300:10:800;

julia> @btime sum($x[$inds]);
  137.235 ns (1 allocation: 496 bytes)

julia> @btime sum(@view $x[$inds]);
  60.803 ns (0 allocations: 0 bytes)

Paul_Soderlind · June 9, 2022, 1:05pm

Thanks, but these results change if I instead do

N = length(x)
inds = rand(1:N,cld(N,2))     #random indices
or 
inds = 1:N .> cld(N,2)          #BitVector

So, it would be interesting to read up on this. Maybe someone has tried to disentangle the various factors?

jakobnissen · June 9, 2022, 1:16pm

If the view is known to be contiguous at compile time, then the view ought to be faster than the copy. However, see `iterate` on `SubArray` uses slow default fallback · Issue #43295 · JuliaLang/julia · GitHub .

jishnub · June 9, 2022, 2:05pm

Random indices are terrible for cache locality, so it’s not that surprising that views with random indices would do badly. views should be excellent when the indices are ranges, but maybe not as good when the indices are vectors.

Paul_Soderlind · June 9, 2022, 7:12pm

Thanks for the inputs. I believe the Performance Tips in the manual might need an update. As of now, the pros and cons of view() are not very clear. Also, the often mentioned (at Discourse) recommendation of using view() should perhaps be qualified.

Perhaps the relative merits of copies/views depend on OS/hardware (cache size, latency etc), but this is what I understand from some simple calculations on an old Win10 notebook:

Views can improve performance when the indices are contiguous and supplied as a UnitRange (and perhaps a StepRange). In most other cases, a copy performs better.

An illustration of these claims:

N = 8_000
x = rand(N)

f1(x,inds) = sum(x[inds])
f2(x,inds) = sum(view(x,inds))

println("vector of length $N, 1st is 'x[inds]', 2nd is view(x,inds)")

println("\ninds is UnitRange")
inds = 1:cld(N,2)
@btime f1($x,$inds)
@btime f2($x,$inds)

println("\ninds is StepRange")
inds = 1:10:N
@btime f1($x,$inds)
@btime f2($x,$inds)

println("\ninds is a contiguous vector")
inds = collect(1:cld(N,2))
@btime f1($x,$inds)
@btime f2($x,$inds)

println("\ninds is a vector of random indices")
inds = rand(1:N,cld(N,2))
@btime f1($x,$inds)
@btime f2($x,$inds)

println("\ninds is a BitVector")
inds = 1:N .> cld(N,2)
@btime f1($x,$inds)
@btime f2($x,$inds)

which gives (on my slow Win10 notebook):

vector of length 8000, 1st is 'x[inds]', 2nd is view(x,inds)  
                                                              
inds is UnitRange                                             
  2.200 μs (2 allocations: 31.30 KiB)                         
  350.237 ns (0 allocations: 0 bytes)                         
                                                              
inds is StepRange                                             
  970.833 ns (1 allocation: 6.38 KiB)                         
  958.333 ns (0 allocations: 0 bytes)                         
                                                              
inds is a contiguous vector                                   
  4.586 μs (2 allocations: 31.30 KiB)                         
  5.917 μs (0 allocations: 0 bytes)                           
                                                              
inds is a vector of random indices                            
  4.833 μs (2 allocations: 31.30 KiB)                         
  5.917 μs (0 allocations: 0 bytes)                           
                                                              
inds is a BitVector                                           
  3.900 μs (2 allocations: 31.30 KiB)                         
  7.400 μs (2 allocations: 31.30 KiB)

mcabbott · June 9, 2022, 8:15pm

Beware that comparing variants with differing allocations by looking at the minimum time is a bit misleading. Assuming that ultimately you will do this in a loop, many times, you will eventually have to wait for the garbage collector, and a fairer comparison is to amortise its cost over the individual calls.

With your code above, on my computer, the differences are not huge, but they do reverse the conclusions of most:

inds is StepRange
  min 988.636 ns, mean 1.879 μs (1 allocation, 6.38 KiB)
  min 1.137 μs, mean 1.158 μs (0 allocations)

inds is a contiguous vector
  min 5.743 μs, mean 9.463 μs (2 allocations, 31.30 KiB)
  min 7.375 μs, mean 7.470 μs (0 allocations)

inds is a vector of random indices
  min 5.854 μs, mean 9.115 μs (2 allocations, 31.30 KiB)
  min 7.375 μs, mean 7.551 μs (0 allocations)

inds is a BitVector
  min 8.250 μs, mean 12.155 μs (2 allocations, 31.30 KiB)
  min 12.333 μs, mean 16.671 μs (2 allocations, 31.30 KiB)

(The mean is of course noisier than the minimum, as things unrelated to the benchmark may occasionally interrupt the work. But still more predictive of final runtime.)

(Edit: In this case, comparing the median also shows a slowdown, maybe half the effect seen by the mean. So it’s not just the garbage collector, as that’s typically < 1% of runs.)

Note also that this is specifically about sum(view(...)). If the next function is something else (such as matrix multiplication) it may care much more about getting a Vector (or a sufficiently simple view).

Paul_Soderlind · June 9, 2022, 8:29pm

Thanks. Very interesting.

SInce the garbage collector is involved, I guess the length of the vector will also matter. So there is also that factor to consider (in case anyone would like to do a writeup.)

Indeed, this is specific to sum. It’s probably a generous case for view.

Topic		Replies	Views
When should you use views? Performance	9	2948	June 10, 2019
Counter-intuitive performance difference Performance question	7	329	March 30, 2025
Making views the default indexing behavior in 2.0? Internals & Design speculative	11	1723	October 21, 2019
Preallocating the result of getindex(...) General Usage question	8	1389	July 13, 2017
Performance of views vs. explicit index referencing Performance	2	501	July 7, 2020

View or copy: looking for an overview

Related topics