Performance differences of contiguous vs non-contiguous column indexing?

A quick test shows a 15% decrease in mean run time for the non-contiguous case, which seems huge. Why would this happen?

using Random,BenchmarkTools,LinearAlgebra
rng = MersenneTwister(1234);

# data
a = rand(rng,1000,10000)
ind = [mod(i,2)+1 for i in 1:100]
ind = [findall(x->x==i,ind) for i in 1:2]
a_shuffle = [a[:,ind[1]] a[:,ind[2]]]

function redparts(x,i1,i2)
    dot(sum(x[:,i1],dims=2),sum(x[:,i2],dims=2))
end

non-contiguous indices

julia> @benchmark redparts(a,ind[1],ind[2])
BenchmarkTools.Trial: 
  memory estimate:  797.45 KiB
  allocs estimate:  15
  --------------
  minimum time:     58.690 μs (0.00% GC)
  median time:      64.291 μs (0.00% GC)
  mean time:        86.662 μs (22.37% GC)
  maximum time:     2.861 ms (96.42% GC)
  --------------
  samples:          10000
  evals/sample:     1

contiguous indices

julia> @benchmark redparts(a_shuffle,1:50,51:100)
BenchmarkTools.Trial: 
  memory estimate:  797.45 KiB
  allocs estimate:  15
  --------------
  minimum time:     56.980 μs (0.00% GC)
  median time:      62.941 μs (0.00% GC)
  mean time:        101.624 μs (22.53% GC)
  maximum time:     3.916 ms (97.64% GC)
  --------------
  samples:          10000
  evals/sample:     1

That is expected, because the access to memory is not contiguous. It can get much worse:

julia> x = rand(100_000);

julia> y = @view x[1:10_000];

julia> @btime sum($y)
  944.154 ns (0 allocations: 0 bytes)
5030.276331565986

julia> y = @view x[rand(1:100_000,10_000)];

julia> @btime sum($y)
  12.243 μs (0 allocations: 0 bytes)
4984.892231863414

julia>
4 Likes

There are two things that cause this:

  • Memory latency (it’s expensive to fetch an arbitrary element from memory, but roughly free to get the next one)
  • Indirection (not only do you need to fetch the element from the array in question, you also need to fetch the index itself and figure out where to go. When you use a range like 1:50 Julia can compute the locations directly instead of looking it up).

Note, though, that most of your time is spent copying the data into contiguous arrays — indexing like x[:,i1] makes a copy. This is a great place for views… and you can very clearly see the difference between latency (not much here — since it’s already column-major) and indirection (using 2:2:100 instead of ind[1] is not only simpler, it’s significantly faster).

julia> @benchmark redparts($a,$ind[1],$ind[2])
BenchmarkTools.Trial:
  memory estimate:  797.44 KiB
  allocs estimate:  14
  --------------
  minimum time:     101.507 μs (0.00% GC)
  median time:      421.714 μs (0.00% GC)
  mean time:        439.045 μs (12.93% GC)
  maximum time:     4.833 ms (90.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts($a_shuffle,1:50,51:100)
BenchmarkTools.Trial:
  memory estimate:  797.44 KiB
  allocs estimate:  14
  --------------
  minimum time:     105.827 μs (0.00% GC)
  median time:      426.267 μs (0.00% GC)
  mean time:        444.088 μs (13.03% GC)
  maximum time:     3.946 ms (88.15% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function redparts_view(x,i1,i2)
           @views dot(sum(x[:,i1],dims=2),sum(x[:,i2],dims=2))
       end
redparts_view (generic function with 1 method)

julia> @benchmark redparts_view($a,$ind[1],$ind[2])
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     58.979 μs (0.00% GC)
  median time:      70.947 μs (0.00% GC)
  mean time:        72.124 μs (1.62% GC)
  maximum time:     5.367 ms (98.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts_view($a_shuffle, 1:50, 51:100)
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     15.081 μs (0.00% GC)
  median time:      17.178 μs (0.00% GC)
  mean time:        19.884 μs (4.64% GC)
  maximum time:     3.478 ms (99.07% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts_view($a, 2:2:100, 1:2:100)
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     16.211 μs (0.00% GC)
  median time:      17.288 μs (0.00% GC)
  mean time:        20.286 μs (3.94% GC)
  maximum time:     3.295 ms (98.99% GC)
  --------------
  samples:          10000
  evals/sample:     1
2 Likes