Fast type indexing

tldr: I’m trying to find the most performant method for indexing Tuple.parameters in a struct similar to SArray{S<:Tuple} where I want f(::SArray{S}, i::Int) where S = S.parameters[i]. All my current solutions require a lot of boilerplate code and don’t scale to more dimensions very well.

Potential Solutions

  1. Stick with Tuple.parameters[i]
  2. Use Val to map each Tuple element. This doesn’t seem to allow much flexibility in terms of iteration because even compiler time known sequences of iterations have to use dynamic dispatch for creating the Val(i). Also, this seems to be really intense to compile because it takes a while the first time around on my computer.
  3. Create a jump table using if else statements to map each Tuple element. This seems performant but I find it harder to metaprogram for multiple dimensions than the Val solution because I have to edit quote blocks and get into the gritty details of expressions are write out by hand each level of dimensionality (e.g., a separate function for 1,2,3… dimension cases). It seems like there should be a macro to help with this but I still need to compose the function head with the contents of the tuple specified.

Progress of Current Investigation

When I had my new friend Traceur take a look at things I found warnings (*gasp*).

struct TmpType{T} end

 t = TmpType{Tuple{1,2,3}}()

julia> @inline dumbfunc(t::TmpType{T}, i::Int) where T = T.parameters[i]
dumbfunc (generic function with 1 method)

julia> @trace dumbfunc(t, 2)
┌ Warning: unsafe_pointer_to_objref returns Any
└ @ pointer.jl:128
┌ Warning: getindex returns Any
└ @ essentials.jl:549
┌ Warning: dumbfunc returns Any
└ @ REPL[9]:1
2

julia> @benchmark dumbfunc(t, 2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.154 ns (0.00% GC)
  median time:      37.900 ns (0.00% GC)
  mean time:        38.525 ns (0.00% GC)
  maximum time:     170.946 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

Alternatively using Val…

julia> dumbfunc2(t::TmpType{T}, ::Val{I}) where {T,I} = T.parameters[I]
dumbfunc2 (generic function with 1 method)

julia> i = Val(2)
Val{2}()

julia> @trace dumbfunc2(t, i)
┌ Warning: unsafe_pointer_to_objref returns Any
└ @ pointer.jl:128
┌ Warning: getindex returns Any
└ @ essentials.jl:549
2

julia> @benchmark dumbfunc2(t, i)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.140 ns (0.00% GC)
  median time:      38.990 ns (0.00% GC)
  mean time:        41.482 ns (0.00% GC)
  maximum time:     140.588 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

So it seems a little better but then I had a crazy idea to handle the unsafe_pointer_to_objref warning.

for i in 1:3
    eval(:(dumbfunc3(::TmpType{Tuple{I1,I2,I3}}, ::Val{$i}) where {I1,I2,I3} = $(Symbol("I",i))::Int))
end

julia> @trace dumbfunc3(t, i)
2

julia> @benchmark dumbfunc3(t, i)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.130 ns (0.00% GC)
  median time:      18.368 ns (0.00% GC)
  mean time:        18.408 ns (0.00% GC)
  maximum time:     53.923 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

Finally helps us get speed because no indexing has to occur.
But this is SUPER restrictive, because this loses all performance once I try iterating
through each value.

@inline function itr_with_index(x::TmpType)
    for idx in 1:3
        dumbfunc(t, idx)
    end
end

@inline function itr_with_val(x::TmpType)
    for idx in 1:3
        dumbfunc3(t, Val(idx))
    end
end

julia> @benchmark itr_with_index(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     127.024 ns (0.00% GC)
  median time:      128.688 ns (0.00% GC)
  mean time:        130.207 ns (0.00% GC)
  maximum time:     325.132 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     891

julia> @benchmark itr_with_val(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.782 μs (0.00% GC)
  median time:      7.868 μs (0.00% GC)
  mean time:        7.972 μs (0.00% GC)
  maximum time:     41.136 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

So maybe I should use ifelse

Base.@pure function dumbfunc4(::TmpType{Tuple{I1,I2,I3}}, i::Int) where {I1,I2,I3}
    if i == 1
        I1::Int
    elseif i == 2
        I2::Int
    elseif i == 3
        I3::Int
    end
end

@inline function itr_with_control_flow(x::TmpType)
    for idx in 1:3
        dumbfunc4(t, idx)
    end
end

julia> @benchmark itr_with_index(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     126.215 ns (0.00% GC)
  median time:      127.884 ns (0.00% GC)
  mean time:        134.846 ns (0.00% GC)
  maximum time:     285.840 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     896

julia> @benchmark itr_with_control_flow(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     70.126 ns (0.00% GC)
  median time:      70.339 ns (0.00% GC)
  mean time:        71.062 ns (0.00% GC)
  maximum time:     205.784 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     975

Best for iteration

Single value get index is comparable.

julia> @benchmark dumbfunc3(t, i)

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.689 ns (0.00% GC)
  median time:      19.742 ns (0.00% GC)
  mean time:        20.376 ns (0.00% GC)
  maximum time:     87.241 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark dumbfunc4(t, 2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.716 ns (0.00% GC)
  median time:      20.045 ns (0.00% GC)
  mean time:        20.127 ns (0.00% GC)
  maximum time:     71.109 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

Now lets see how this scales.

function big_itr_with_index(x::TmpType)
    for i in 1:1_000_000
        itr_with_index(x)
    end
end

function big_itr_with_control_flow(x::TmpType)
    for i in 1:1_000_000
        itr_with_control_flow(x)
    end
end


julia> @benchmark big_itr_with_index(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     106.994 ms (0.00% GC)
  median time:      107.456 ms (0.00% GC)
  mean time:        107.868 ms (0.00% GC)
  maximum time:     114.997 ms (0.00% GC)
  --------------
  samples:          47
  evals/sample:     1

julia> @benchmark big_itr_with_control_flow(t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.505 ms (0.00% GC)
  median time:      55.100 ms (0.00% GC)
  mean time:        55.156 ms (0.00% GC)
  maximum time:     57.452 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

So this is roughly twice as fast.

Can you explain what this is for? It’s a bit unclear to me what’s wrong with the naiive solution, why is 35ns too slow?

If you are willing to write f(type, Val(i)) then it should be possible to make a generated function which will compile to nothing. But that won’t help if i is a run-time integer.

1 Like

Typically I’d a agree that 34ns would be inconsequential, but my use case is a lot bigger than the MWE I created. I’m creating a custom layer for a neural network.

My actual type is a StaticRange{B,E,S,L,T} (begin/start, end/stop, step, length, type) and applying a sliding window of to a 240 x 240 x 240 x 1 x 28 dimension array (I have ~133 of these arrays). At any given time I have 2 sets of StaticRanges so I’m trying create a function that looks something like this.

function applywindow!(f::Function, a::AbstractArray,
                      inds::StaticIndices{Tuple{<:StaticRange,<:StaticRange,<:StaticRange,<:StatiRange},S,T,N},
                      window::StaticIndices{Tuple{<:StaticRange,<:StaticRange,<:StaticRange,<:StatiRange},S,T,N})
    # apply function over all indices
end

I’ve already figured out how to get my StaticRange to index like those in base and do linear indexing, etc. but writing out all the values takes forever and I’d like to be able to generalize this to other layers without doing it all by hand every time.

Note: I know the ideal solution would be to use GPUs but that’s currently is not an option.

It sounds like S,L,T are fixed things, but B,E change with iterations of some process? If so, that sounds awful. Putting the size of some window into the type might be a good idea (e.g. for unrolling something) but putting its position there sounds like it will cause endless compilation.

But that aside, if these 5 type parameters have different meanings, why do you want to extract them according to an integer i? Why can’t f(x::StaticRange{B,E,S,L,T}) where {B,E,S,L,T} extract them?

1 Like

I didn’t think of taking out those parts of the window variable. I was planning on getting the relative position of each index in the window and just adding my stride values along each dimension to them. I guess I just didn’t want to create another Type, but it might make more sense.

I just fixed my previous post. My StaticIndices are actually a subtype of StaticArrays and the StaticRange types are within a Tuple like in the MWE. I can definitely extract them as you suggested, but I have the same problem in the original example.

FYI you should probably use fieldtype instead of .parameters[i].

julia> t = typeof((1,2.0,3))
Tuple{Int64,Float64,Int64}

julia> fieldtype(t, 2)
Float64

It gained a few %s over your dumbfunc in my trial.

Your loop iteration code had a bug:

@inline function itr_with_index(x::TmpType)
    for idx in 1:3
        dumbfunc(t, idx)
    end
end

The argument is x, but the loop iterates over a global t. With the correction,

julia> @inline function itr_with_index(t::TmpType)
           for idx in 1:3
               dumbfunc(t, idx)
           end
       end
itr_with_index (generic function with 1 method)

julia> @benchmark itr_with_index($t)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.025 ns (0.00% GC)
  median time:      35.128 ns (0.00% GC)
  mean time:        35.656 ns (0.00% GC)
  maximum time:     124.727 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @unroll function itr_with_unroll(t::TmpType, r=@fixed_range(1:3))
           @unroll for idx in r
               dumbfunc(t, idx)
           end
       end
itr_with_unroll_unrolled_expansion_ (generic function with 1 method)

julia> @benchmark itr_with_unroll($t)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.020 ns (0.00% GC)
  median time:      0.033 ns (0.00% GC)
  mean time:        0.033 ns (0.00% GC)
  maximum time:     6.584 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

With Unrolled, Julia performed the whole computation at compile-time. Try it in a real scenario and see how it performs.

BTW, my experience is that this kind of manipulation is fine by itself, but it doesn’t scale. Don’t build parametric types of parametric types of …

2 Likes

Thank you! I’ll check back in as I do some more thorough testing.

Yes, it takes a lot of care to appropriately do this without crossing a line.