How to select elements of a three-dimensional array fast

I have large three-dimensional arrays A that have different values, but from which I always wish to select the same set of entries. So A is a different matrix each time, but it has the same size and I need to collect the same entries: see the example code below. The first dimension never plays a role in which elements are selected. I want this to go fast.

The experiment below compares several methods. On my ageing AMD machine at home, Tullio wins hands down, whereas on my newer Intel machine at work I get the following times:

base:   1.629 ms (0 allocations: 0 bytes)
folds:   783.790 μs (91 allocations: 8.94 KiB)
mask:   4.942 ms (0 allocations: 0 bytes)
slices:   1.240 ms (0 allocations: 0 bytes)
loopy:   439.534 μs (82 allocations: 8.44 KiB)
loopier:   653.814 μs (3082 allocations: 7.70 MiB)
tullio:   1.062 ms (41 allocations: 2.23 KiB)

Three questions:

  1. What better alternatives exist?
  2. Why is mask so slow?
  3. What explains the discrepancy between home and office machines? [true for any number of cores]

Thanks!!!

using BenchmarkTools, LoopVectorization, Tullio, Base.Threads, Folds, Referenceables


function f( ::Val{:base}, R, A, b, X  )
    for i ∈ axes( A, 2 )
        for r ∈ axes( A,1 )
            R[r,i] = A[r,i,b[i]]
        end
    end
    return nothing
end

function f( ::Val{:loopy}, R, A, b, X  )
    @threads for i ∈ axes( A, 2 )
        for r ∈ axes( A,1 )
            R[r,i] = A[r,i,b[i]]
        end
    end
    return nothing
end

function f( ::Val{:loopier}, R, A, b, X  )
    @threads for i ∈ axes( A, 2 )
        R[:,i] .= A[:,i,b[i]]
    end
    return nothing
end

function f( ::Val{:tullio}, R, A, b, X )
    @tullio fastmath=false R[r,i] = A[r,i,b[i]]
    return nothing
end



function f( ::Val{:mask}, R, A, b, X )
    # size( A[X] ) )
 @views     R[:] = A[X]
    return nothing
end




function f( ::Val{:slices}, R, A, b, X )
    i = 0
    for (x,y) ∈ zip( eachslice( A; dims=2 ), eachslice( R; dims = 2 ) )
        i += 1
        bᵢ = b[i]
        for r ∈ axes(x,1)
            y[r] = x[r,bᵢ]
        end
    end
    return nothing
end


function f( ::Val{:folds}, R, A, b, X )
    Folds.foreach( referenceable( vec(R) ), X ) do y, x
        y[] = A[x]
    end
end

const d₁ = 1_000
const d₂ = 1_000
const d₃ = 100





for meth ∈ [ :base, :folds, :mask, :slices, :loopy, :loopier, :tullio,  ]
    print( "$(meth): ")
    @btime f( Val($meth), R, A, b, X ) setup=( R=zeros( d₁, d₂); A = randn( d₁,d₂,d₃); b = rand(1:d₃,d₂); X = findall([ (b[i]==j) for r ∈ 1:d₁, i ∈ 1:d₂, j∈ 1:d₃]) )
end
''''

The :base one spends a lot of time on bounds-checking. The version below is twice as fast for me, but only safe if you are sure that b is all in the expected range.

julia> @btime f( Val(:base), R, A, b, nothing) setup=( R=zeros( d₁, d₂); A = randn( d₁,d₂,d₃); b = rand(1:d₃,d₂);)
  1.952 ms (0 allocations: 0 bytes)

julia> function f( ::Val{:base2}, R, A, b, X  )
           axes(A)[1:2] == axes(R) || error("bad size")
           r1 = first(axes(R,1))
           for i ∈ axes( A, 2 )
               b_i = b[i]
               # checkbounds(A,r1,i,b_i)  # including this gives 1.508 ms
               for r ∈ axes( A,1 )
                   @inbounds R[r,i] = A[r,i,b_i]
               end
           end
           return nothing
       end
f (generic function with 2 methods)

julia> @btime f( Val(:base2), R, A, b, nothing) setup=( R=zeros( d₁, d₂); A = randn( d₁,d₂,d₃); b = rand(1:d₃,d₂);)
  980.334 μs (0 allocations: 0 bytes)
2 Likes

The :mask is probably slow because it doesn’t know that it’s taking a bunch of sequential entries (X is just a Vector{CartesianIndex{3}}), so it’s needing to recompute the indices for every access and also it can’t vectorize well because of the possible random access.

These are benchmarking slightly slower than slices (no idea why), though they themselves are basically identical. These do show nice improvements over :base. I suspect that :base is concerned that b[i] might change mid-call so is frequently re-loading it.

note: the poster above me also suggested hoisting, but I’m posting this anyway for the :base_view version which I think is quite readable.

function f( ::Val{:base_hoist}, R, A, b, X  )
    for i ∈ axes( A, 2 )
        bi = b[i]
        for r ∈ axes( A,1 )
            R[r,i] = A[r,i,bi]
        end
    end
    return nothing
end

function f( ::Val{:base_view}, R, A, b, X  )
    for i ∈ axes( A, 2 )
        @views copy!(R[:, i], A[:, i, b[i]])
    end
    return nothing
end

I’m not going to talk about the multithreading ones because that’s a different comparison. Note that a function like this is memory-bound anyway (except at small sizes).

1 Like

One option that may be faster is through repeated copyto!s based on the linear index offsets for each chunk — and then possibly threading over the chunks. copyto! for Arrays can hit optimized memmove instructions that can exploit wider bit-widths than the element size.

1 Like

Thanks a lot all for your comments!