# Accessing array element: splatting VS explicit writing them out

Dear Julia experts. Another newbie question.
I try the following (equivalent?) ways to assessing (& operating on) array elements

``````latt[x..., d] VS latt[x[1], x[2], x[3], d]
``````

and find big difference in performance. I would like to understand why this is so, and get lectured on some general good tips to speed things up. Thank you for your help!

``````function test1()

N = 250
mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

links = Iterators.product(1:N, 1:N, 1:N, 1:3)
latt1 = ones(Int64, (N, N, N, 3))

obs = 0.0
for d = 1:3, i3 = 1:N, i2 = 1:N, i1 = 1:N
x = [i1, i2, i3]
mvup!(x, d)

# slow
# obs += latt1[i1, i2, i3, d] * latt1[x..., d]
# fast
obs += latt1[i1, i2, i3, d] * latt1[x[1], x[2], x[3], d]
end
end

function test2()

N = 250
mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

links = Iterators.product(1:N, 1:N, 1:N, 1:3)
latt = ones(Int64, (N, N, N, 3))

obs = 0.0
mvup!(x, d)

# slow
# obs += latt[link...] * latt[x..., d]
# fast
obs += latt[link...] * latt[x[1], x[2], x[3], d]
end
end

@time test1()
@time test2()

#=
result:
slow:
12.540900 seconds (187.50 M allocations: 5.937 GiB, 7.50% gc time, 0.06% compilation time)
14.356129 seconds (187.50 M allocations: 5.937 GiB, 7.54% gc time)
fast:
3.826308 seconds (46.88 M allocations: 3.842 GiB, 17.36% gc time)
3.617770 seconds (46.88 M allocations: 3.842 GiB, 18.24% gc time)
=#
``````

My guess is that splatting is slow because your vector `x` does not have necessarily fixed length at runtime, and the compiler canβt optimize around that.

In general, you want to minimize βcontext-dependent instructions,β if thatβs the right term. While you might always intend `x` to have length 3, the code `latt[x..., d]` leaves open the possibility that `x` will have a different length at runtime, whereas `latt[x[1], x[2], x[3], d]` gives a specific instruction about what data to access that doesnβt depend on the length of `x` at runtime.

2 Likes

You should probably use `CartesianIndex` for this. They are fixed-length, and can be used directly in place of some or all of the indices without the need for splatting.
More importantly, they are also immutable, which means they require no allocation, and may also open up more compiler optimizations.
Some benchmarks (`CartesianIndex` used in `test3` below):

``````function test1(latt1)
N = size(latt1,1)
mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

obs = 0.0
for d = 1:3, i3 = 1:N, i2 = 1:N, i1 = 1:N
x = [i1, i2, i3]
mvup!(x, d)

obs += latt1[i1, i2, i3, d] * latt1[x[1], x[2], x[3], d]
end
obs
end

function test2(latt1)
N = size(latt1,1)
mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

links = Iterators.product(1:N, 1:N, 1:N, 1:3)

obs = 0.0
mvup!(x, d)

obs += latt1[link...] * latt1[x[1], x[2], x[3], d]
end
end

function test3(latt1)
N = size(latt1,1)
mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

obs = 0.0
for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
obs += latt1[x, d] * latt1[mvup(x, d), d]
end
obs
end
``````
``````julia> @benchmark test1(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 2 samples with 1 evaluations.
Range (min β¦ max):  3.444 s β¦   3.508 s  β GC (min β¦ max): 13.96% β¦ 18.72%
Time  (median):     3.476 s              β GC (median):    16.36%
Time  (mean Β± Ο):   3.476 s Β± 45.332 ms  β GC (mean Β± Ο):  16.36% Β±  3.37%

β                                                       ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.44 s         Histogram: frequency by time        3.51 s <

Memory estimate: 4.89 GiB, allocs estimate: 46875000.

julia> @benchmark test2(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 2 samples with 1 evaluations.
Range (min β¦ max):  3.363 s β¦   3.387 s  β GC (min β¦ max): 19.70% β¦ 14.64%
Time  (median):     3.375 s              β GC (median):    17.16%
Time  (mean Β± Ο):   3.375 s Β± 17.231 ms  β GC (mean Β± Ο):  17.16% Β±  3.58%

β                                                       ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.36 s         Histogram: frequency by time        3.39 s <

Memory estimate: 4.89 GiB, allocs estimate: 46875000.

julia> @benchmark test3(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 10 samples with 1 evaluations.
Range (min β¦ max):  389.657 ms β¦ 416.891 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     397.419 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   400.080 ms Β±   9.444 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β    β β     β       β     β             β        β         ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
390 ms           Histogram: frequency by time          417 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````
4 Likes

A much more flexible method to implement periodic boundary conditions (which I guess is what you are doing) is to use βghost cellsβ. See e.g. Finite difference Laplacian with five-point stencil - #2 by stevengj

This is allocating a new array on every loop iteration, which will be slow. I second the suggestion to use `CartesianIndices` here β thatβs what they are for. (You could also use StaticArrays.jl). See Base.Cartesian Β· The Julia Language

2 Likes

Thank jeffreyesun, yha, stevengj for the helpful answers. This thing is now insanely fast!

One thing I have learned here is that Tuple (can come from elements of iterator / CartesianIndex) is fast! And reassigning to a new tuple is better than changing the mutable array when used in retrieving elements of array. Thx yha for tip on changing the immutable!

Also, there is no problem in tuple + splatting. (Now it seems itβs a matter of style?)

``````function test3(latt1)
N = size(latt1,1)
mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

obs = 0.0
for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
obs += latt1[x, d] * latt1[mvup(x, d), d]
end
obs/N^2
end

function test3b(latt1)
N = size(latt1,1)
mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

function mvup2(x::CartesianIndex, d)
function _f(i1::Int)
if i1 != d
return x[i1]
elseif i1 == d && x[i1] != N
return x[i1] + 1
else
return 1
end
end
return CartesianIndex(ntuple(_f, 3))
end

obs = 0.0
for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
obs += latt1[x, d] * latt1[mvup2(x, d), d]
end
obs/N^2
end

function test4(latt1)
N = size(latt1,1)
links = Iterators.product(1:N, 1:N, 1:N, 1:3)

function mvup(x, d)
function _f(i1::Int)
if i1 != d
return x[i1]
elseif i1 == d && x[i1] != N
return x[i1] + 1
else
return 1
end
end
return ntuple(_f, 3)
end

obs = 0.0
obs += latt1[link...] * latt1[mvup(x,d)..., d]
end
obs/N^2
end
``````

and I get

``````julia> @benchmark test3(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min β¦ max):  705.676 ms β¦ 766.002 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     732.916 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   735.877 ms Β±  22.589 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β                    β     β              β                 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
706 ms           Histogram: frequency by time          766 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test3b(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min β¦ max):  259.175 ms β¦ 427.390 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     322.829 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   321.440 ms Β±  52.464 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β  β        β         ββ    ββ                              β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
259 ms           Histogram: frequency by time          427 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test4(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min β¦ max):  356.814 ms β¦ 423.293 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     363.248 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   374.044 ms Β±  22.829 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β β  ββ           β           β                             β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
357 ms           Histogram: frequency by time          423 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

One small thing I found is writing out (stupidly) the mvup / mvdown function seems to be slightly faster than mod1. Will look at the βghost gridβ stevengj suggested.

Thank you all again for the amazing comments. This is one of the few times I see my Julia code beating my fortran code :D.