Accessing array element: splatting VS explicit writing them out

pmlo · July 20, 2021, 9:14am

Dear Julia experts. Another newbie question.
I try the following (equivalent?) ways to assessing (& operating on) array elements

latt[x..., d] VS latt[x[1], x[2], x[3], d]

and find big difference in performance. I would like to understand why this is so, and get lectured on some general good tips to speed things up. Thank you for your help!

function test1()

    N = 250
    mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

    links = Iterators.product(1:N, 1:N, 1:N, 1:3)
    latt1 = ones(Int64, (N, N, N, 3))

    obs = 0.0
    for d = 1:3, i3 = 1:N, i2 = 1:N, i1 = 1:N
        x = [i1, i2, i3]
        mvup!(x, d)

        # slow
        # obs += latt1[i1, i2, i3, d] * latt1[x..., d]
        # fast
        obs += latt1[i1, i2, i3, d] * latt1[x[1], x[2], x[3], d]
    end
end

function test2()

    N = 250
    mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

    links = Iterators.product(1:N, 1:N, 1:N, 1:3)
    latt = ones(Int64, (N, N, N, 3))

    obs = 0.0
    for link in links
        x = [link[1:3]...]
        d = link[4]
        mvup!(x, d)

        # slow
        # obs += latt[link...] * latt[x..., d]
        # fast
        obs += latt[link...] * latt[x[1], x[2], x[3], d]
    end
end


@time test1()
@time test2()


#=
result:
slow:
12.540900 seconds (187.50 M allocations: 5.937 GiB, 7.50% gc time, 0.06% compilation time)
14.356129 seconds (187.50 M allocations: 5.937 GiB, 7.54% gc time)
fast:
3.826308 seconds (46.88 M allocations: 3.842 GiB, 17.36% gc time)
3.617770 seconds (46.88 M allocations: 3.842 GiB, 18.24% gc time)
=#

jeffreyesun · July 20, 2021, 8:39pm

My guess is that splatting is slow because your vector x does not have necessarily fixed length at runtime, and the compiler can’t optimize around that.

In general, you want to minimize “context-dependent instructions,” if that’s the right term. While you might always intend x to have length 3, the code latt[x..., d] leaves open the possibility that x will have a different length at runtime, whereas latt[x[1], x[2], x[3], d] gives a specific instruction about what data to access that doesn’t depend on the length of x at runtime.

yha · July 20, 2021, 9:13pm

You should probably use CartesianIndex for this. They are fixed-length, and can be used directly in place of some or all of the indices without the need for splatting.
More importantly, they are also immutable, which means they require no allocation, and may also open up more compiler optimizations.
Some benchmarks (CartesianIndex used in test3 below):

function test1(latt1)
    N = size(latt1,1)
    mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

    obs = 0.0
    for d = 1:3, i3 = 1:N, i2 = 1:N, i1 = 1:N
        x = [i1, i2, i3]
        mvup!(x, d)

        obs += latt1[i1, i2, i3, d] * latt1[x[1], x[2], x[3], d]
    end
    obs
end

function test2(latt1)
    N = size(latt1,1)
    mvup!(xvec, d) = xvec[d] == N ? xvec[d] = 1 : xvec[d] += 1

    links = Iterators.product(1:N, 1:N, 1:N, 1:3)

    obs = 0.0
    for link in links
        x = [link[1:3]...]
        d = link[4]
        mvup!(x, d)

        obs += latt1[link...] * latt1[x[1], x[2], x[3], d]
    end
end

function test3(latt1)
    N = size(latt1,1)
    mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

    obs = 0.0
    for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
        obs += latt1[x, d] * latt1[mvup(x, d), d]
    end
    obs
end

julia> @benchmark test1(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 2 samples with 1 evaluations.
 Range (min … max):  3.444 s …   3.508 s  ┊ GC (min … max): 13.96% … 18.72%
 Time  (median):     3.476 s              ┊ GC (median):    16.36%
 Time  (mean ± σ):   3.476 s ± 45.332 ms  ┊ GC (mean ± σ):  16.36% ±  3.37%

  █                                                       ██
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.44 s         Histogram: frequency by time        3.51 s <

 Memory estimate: 4.89 GiB, allocs estimate: 46875000.

julia> @benchmark test2(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 2 samples with 1 evaluations.
 Range (min … max):  3.363 s …   3.387 s  ┊ GC (min … max): 19.70% … 14.64%
 Time  (median):     3.375 s              ┊ GC (median):    17.16%
 Time  (mean ± σ):   3.375 s ± 17.231 ms  ┊ GC (mean ± σ):  17.16% ±  3.58%

  █                                                       ██
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.36 s         Histogram: frequency by time        3.39 s <

 Memory estimate: 4.89 GiB, allocs estimate: 46875000.

julia> @benchmark test3(latt1) setup=(latt1=rand(250,250,250,3))
BechmarkTools.Trial: 10 samples with 1 evaluations.
 Range (min … max):  389.657 ms … 416.891 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     397.419 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   400.080 ms ±   9.444 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁    █ ▁     ▁       ▁     ▁             ▁        ▁         ▁▁
  █▁▁▁▁█▁█▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█ ▁
  390 ms           Histogram: frequency by time          417 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

stevengj · July 21, 2021, 2:34am

A much more flexible method to implement periodic boundary conditions (which I guess is what you are doing) is to use “ghost cells”. See e.g. Finite difference Laplacian with five-point stencil - #2 by stevengj

This is allocating a new array on every loop iteration, which will be slow. I second the suggestion to use CartesianIndices here — that’s what they are for. (You could also use StaticArrays.jl). See Base.Cartesian · The Julia Language

pmlo · July 21, 2021, 11:39am

Thank jeffreyesun, yha, stevengj for the helpful answers. This thing is now insanely fast!

One thing I have learned here is that Tuple (can come from elements of iterator / CartesianIndex) is fast! And reassigning to a new tuple is better than changing the mutable array when used in retrieving elements of array. Thx yha for tip on changing the immutable!

Also, there is no problem in tuple + splatting. (Now it seems it’s a matter of style?)

function test3(latt1)
    N = size(latt1,1)
    mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

    obs = 0.0
    for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
        obs += latt1[x, d] * latt1[mvup(x, d), d]
    end
    obs/N^2
end

function test3b(latt1)
    N = size(latt1,1)
    mvup(x::CartesianIndex, d) = CartesianIndex(ntuple(i -> i==d ? mod1(x[i]+1,N) : x[i], length(x)))

    function mvup2(x::CartesianIndex, d)
        function _f(i1::Int)
            if i1 != d
                return x[i1]
            elseif i1 == d && x[i1] != N
                return x[i1] + 1
            else
                return 1
            end 
        end
        return CartesianIndex(ntuple(_f, 3))
    end

    obs = 0.0
    for d = 1:3, x in CartesianIndex(1, 1, 1):CartesianIndex(N,N,N)
        obs += latt1[x, d] * latt1[mvup2(x, d), d]
    end
    obs/N^2
end

function test4(latt1)
    N = size(latt1,1)
    links = Iterators.product(1:N, 1:N, 1:N, 1:3)

    function mvup(x, d)
        function _f(i1::Int)
            if i1 != d
                return x[i1]
            elseif i1 == d && x[i1] != N
                return x[i1] + 1
            else
                return 1
            end 
        end
        return ntuple(_f, 3)
    end

    obs = 0.0
    for link in links
        x = link[1:3]
        d = link[4]
        obs += latt1[link...] * latt1[mvup(x,d)..., d]
    end
    obs/N^2
end

and I get

julia> @benchmark test3(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  705.676 ms … 766.002 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     732.916 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   735.877 ms ±  22.589 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                    █     █              █                 █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  706 ms           Histogram: frequency by time          766 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test3b(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  259.175 ms … 427.390 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     322.829 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   321.440 ms ±  52.464 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █        █         ██    ██                              █
  █▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁██▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  259 ms           Histogram: frequency by time          427 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test4(latt1) setup=(latt1=rand(250,250,250,3))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  356.814 ms … 423.293 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     363.248 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   374.044 ms ±  22.829 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ █  ▁▁           ▁           ▁                             ▁
  █▁█▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  357 ms           Histogram: frequency by time          423 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

One small thing I found is writing out (stupidly) the mvup / mvdown function seems to be slightly faster than mod1. Will look at the “ghost grid” stevengj suggested.

Thank you all again for the amazing comments. This is one of the few times I see my Julia code beating my fortran code :D.