Ntuple aggressive specialisation and boxed values

Hi,

Further to my questions on the General slack today I decided to do an experiment, pitting an NTuple based kmer type, against the existing BioSequences Mer types which are based on primitive types.

Valentin advised me to use ntuple to generate “tail” in the code, and to use ntuple in a way that it aggressively specialised.

Anyway I did and the results are here: https://gist.github.com/BenJWard/4e06c5c4f4648c594fdb1a886cf5042d

When I benchmarked I found v. bad performance:

julia> @benchmark DNAKmer{63,2}($dnaseq)
BenchmarkTools.Trial: 
  memory estimate:  1.36 KiB
  allocs estimate:  87
  --------------
  minimum time:     24.255 μs (0.00% GC)
  median time:      24.460 μs (0.00% GC)
  mean time:        25.328 μs (0.00% GC)
  maximum time:     307.687 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Vs the primitive type:

julia> @benchmark BigDNAMer{63}($dnaseq)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     126.889 ns (0.00% GC)
  median time:      126.991 ns (0.00% GC)
  mean time:        130.191 ns (0.00% GC)
  maximum time:     303.371 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     883

But looking at the code warntype of the function I could see the variable idx was boxed, and the tuple “tail” had an element type of any.

I wasnt sure why the boxin occured, but changing idx to a Ref fixed the issue, and it beats the primitive type performance!

julia> @benchmark DNAKmer{63,2}($dnaseq)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     91.087 ns (0.00% GC)
  median time:      91.181 ns (0.00% GC)
  mean time:        92.795 ns (0.00% GC)
  maximum time:     243.353 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     952

On Julia 1.5

Why did the boxing occur, and why was it Ref fixed it - in both cases the idx is just an int.

2 Likes
@inline shiftright(x::BigDNAMer{K}) where {K} = BigDNAMer{K}(reinterpret(UInt128, x) >> 2)

@inline function shiftright(x::Kmer{A,K,N}) where {A,K,N}
    head = @inbounds x.data[1] >> 2
    tail = ntuple(Val{N - 1}()) do i
        Base.@_inline_meta
        j = i + 1
        @inbounds begin
            return (x.data[j] >> 2) | ((x.data[i] & UInt64(3)) << 62)
        end
    end
    return Kmer{A,K,N}((head, tail...))
end

@benchmark shiftright($m)
@benchmark shiftright($oldm)

I tried to implement a good rightshift of all the nucleotides stored in a ntuple based Kmer.
Performance is almost as good a the UInt128 based kmer oldm. But not quite. Any way I could get it even quicker?

Turns out you can get faster:

    return _shiftright(zero(UInt64), x.data...)
end

@inline function _shiftright(carry::UInt64, head::UInt64, tail...)
    Base.@_inline_meta
    return ((head >> 2) | carry, _shiftright((head & UInt64(3)) << 62, tail...)...)
end

@inline _shiftright(carry::UInt64) = ()

The Base.@_inline_meta should be redundant here, since it is equivalent to the @inline already in front of the function definition.

1 Like

So I tried to introduce a similar shiftleft function:

@inline function shiftleft(x::Kmer{A,K,N}) where {A,K,N}
    _, newbits = _shiftleft(x.data...)
    return Kmer{A,K,N}(_cliphead(64N - 2K, newbits...))
end

@inline function _cliphead(by::Integer, head::UInt64, tail...)
    return (head & (typemax(UInt64) >> by), tail...)
end

@inline function _shiftleft(head::UInt64, tail...)
    carry, newtail = _shiftleft(tail...)
    return head >> 62, ((head << 2) | carry, newtail...)
end

@inline _shiftleft(head::UInt64) = (head & 0xC000000000000000) >> 62, head << 2

Which does similar to shiftright, but with clipping of the tuple’s head element at the end.

Benchmarking this I get allocations and slower times:

julia> @benchmark shiftleft($m)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  6
  --------------
  minimum time:     227.321 ns (0.00% GC)
  median time:      229.827 ns (0.00% GC)
  mean time:        241.572 ns (1.41% GC)
  maximum time:     3.747 μs (93.68% GC)
  --------------
  samples:          10000
  evals/sample:     442
julia> @code_warntype shiftleft(m)
Variables
  #self#::Core.Compiler.Const(shiftleft, false)
  x::Kmer{DNAAlphabet{2},63,2}
  @_3::Int64
  newbits::Tuple{UInt64,UInt64}

Body::Kmer{DNAAlphabet{2},63,2}
1 ─       nothing
│   %2  = Base.getproperty(x, :data)::Tuple{UInt64,UInt64}
│   %3  = Core._apply_iterate(Base.iterate, Main._shiftleft, %2)::Tuple{UInt64,Tuple{UInt64,UInt64}}
│   %4  = Base.indexed_iterate(%3, 1)::Core.Compiler.PartialStruct(Tuple{UInt64,Int64}, Any[UInt64, Core.Compiler.Const(2, false)])
│         Core.getfield(%4, 1)
│         (@_3 = Core.getfield(%4, 2))
│   %7  = Base.indexed_iterate(%3, 2, @_3::Core.Compiler.Const(2, false))::Core.Compiler.PartialStruct(Tuple{Tuple{UInt64,UInt64},Int64}, Any[Tuple{UInt64,UInt64}, Core.Compiler.Const(3, false)])
│         (newbits = Core.getfield(%7, 1))
│   %9  = Core.apply_type(Main.Kmer, $(Expr(:static_parameter, 1)), $(Expr(:static_parameter, 2)), $(Expr(:static_parameter, 3)))::Core.Compiler.Const(Kmer{DNAAlphabet{2},63,2}, false)
│   %10 = (%9)(newbits)::Kmer{DNAAlphabet{2},63,2}
└──       return %10