Performance differences of contiguous vs non-contiguous column indexing?

mkarikom · January 12, 2021, 6:25pm

A quick test shows a 15% decrease in mean run time for the non-contiguous case, which seems huge. Why would this happen?

using Random,BenchmarkTools,LinearAlgebra
rng = MersenneTwister(1234);

# data
a = rand(rng,1000,10000)
ind = [mod(i,2)+1 for i in 1:100]
ind = [findall(x->x==i,ind) for i in 1:2]
a_shuffle = [a[:,ind[1]] a[:,ind[2]]]

function redparts(x,i1,i2)
    dot(sum(x[:,i1],dims=2),sum(x[:,i2],dims=2))
end

non-contiguous indices

julia> @benchmark redparts(a,ind[1],ind[2])
BenchmarkTools.Trial: 
  memory estimate:  797.45 KiB
  allocs estimate:  15
  --------------
  minimum time:     58.690 μs (0.00% GC)
  median time:      64.291 μs (0.00% GC)
  mean time:        86.662 μs (22.37% GC)
  maximum time:     2.861 ms (96.42% GC)
  --------------
  samples:          10000
  evals/sample:     1

contiguous indices

julia> @benchmark redparts(a_shuffle,1:50,51:100)
BenchmarkTools.Trial: 
  memory estimate:  797.45 KiB
  allocs estimate:  15
  --------------
  minimum time:     56.980 μs (0.00% GC)
  median time:      62.941 μs (0.00% GC)
  mean time:        101.624 μs (22.53% GC)
  maximum time:     3.916 ms (97.64% GC)
  --------------
  samples:          10000
  evals/sample:     1

lmiq · January 12, 2021, 8:01pm

That is expected, because the access to memory is not contiguous. It can get much worse:

julia> x = rand(100_000);

julia> y = @view x[1:10_000];

julia> @btime sum($y)
  944.154 ns (0 allocations: 0 bytes)
5030.276331565986

julia> y = @view x[rand(1:100_000,10_000)];

julia> @btime sum($y)
  12.243 μs (0 allocations: 0 bytes)
4984.892231863414

julia>

mbauman · January 12, 2021, 9:23pm

There are two things that cause this:

Memory latency (it’s expensive to fetch an arbitrary element from memory, but roughly free to get the next one)
Indirection (not only do you need to fetch the element from the array in question, you also need to fetch the index itself and figure out where to go. When you use a range like 1:50 Julia can compute the locations directly instead of looking it up).

Note, though, that most of your time is spent copying the data into contiguous arrays — indexing like x[:,i1] makes a copy. This is a great place for views… and you can very clearly see the difference between latency (not much here — since it’s already column-major) and indirection (using 2:2:100 instead of ind[1] is not only simpler, it’s significantly faster).

julia> @benchmark redparts($a,$ind[1],$ind[2])
BenchmarkTools.Trial:
  memory estimate:  797.44 KiB
  allocs estimate:  14
  --------------
  minimum time:     101.507 μs (0.00% GC)
  median time:      421.714 μs (0.00% GC)
  mean time:        439.045 μs (12.93% GC)
  maximum time:     4.833 ms (90.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts($a_shuffle,1:50,51:100)
BenchmarkTools.Trial:
  memory estimate:  797.44 KiB
  allocs estimate:  14
  --------------
  minimum time:     105.827 μs (0.00% GC)
  median time:      426.267 μs (0.00% GC)
  mean time:        444.088 μs (13.03% GC)
  maximum time:     3.946 ms (88.15% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function redparts_view(x,i1,i2)
           @views dot(sum(x[:,i1],dims=2),sum(x[:,i2],dims=2))
       end
redparts_view (generic function with 1 method)

julia> @benchmark redparts_view($a,$ind[1],$ind[2])
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     58.979 μs (0.00% GC)
  median time:      70.947 μs (0.00% GC)
  mean time:        72.124 μs (1.62% GC)
  maximum time:     5.367 ms (98.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts_view($a_shuffle, 1:50, 51:100)
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     15.081 μs (0.00% GC)
  median time:      17.178 μs (0.00% GC)
  mean time:        19.884 μs (4.64% GC)
  maximum time:     3.478 ms (99.07% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark redparts_view($a, 2:2:100, 1:2:100)
BenchmarkTools.Trial:
  memory estimate:  16.03 KiB
  allocs estimate:  10
  --------------
  minimum time:     16.211 μs (0.00% GC)
  median time:      17.288 μs (0.00% GC)
  mean time:        20.286 μs (3.94% GC)
  maximum time:     3.295 ms (98.99% GC)
  --------------
  samples:          10000
  evals/sample:     1

Topic		Replies	Views
Contiguous setindex involving an adjoint is slower than non-contiguous for a Matrix General Usage question , indexing	5	387	May 25, 2021
View or copy: looking for an overview Performance views	8	577	June 9, 2022
Counter-intuitive performance difference Performance question	7	328	March 30, 2025
Performance difference when accessing square matrix rows-first or cols-first Performance	14	1730	April 13, 2021
Why the performance varies greatly after @view? Performance performance	2	498	October 27, 2021

Performance differences of contiguous vs non-contiguous column indexing?

non-contiguous indices

contiguous indices

Related topics