Attempting to customise broadcasting for BioSequences

Hello! I’m currently trying to customise and improve how Broadcasting is done with BioSequences.jl - types. Specifically LongSequence at this moment, for a v3 release. And I could do with some help and advice doing it properly.

So, BioSequence types are not strictly arrays in that they don’t inherit from any abstract array type. But they implement and satisfy indexing, iteration etc. As if they were vectors - they have one dimension, they have the IndexStyle of Base.IndexLinear, they have length, size will return (length(x),) and axes will return (OneTo(length(x)), )

We want to customise broadcasting such that - in some cases, the output type is a LongSequence - but tbh most of the time it’s probably a vector/array.

So I want to start with a simple example: a .== DNA_G, where a is a LongSequence.

julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]

julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG

julia> a .== DNA_G
20-element BitVector:
 0
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 1

julia> Base.broadcastable(a)
20-element Vector{DNA}:
 DNA_A
 DNA_T
 DNA_C
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_G

So for the simple example of a .== DNA_G, we expect an array (well a bitarray) out, and we get it. BUT we can see broadcasting spends the time and memory to build a vector of DNA (as per Base.broadcastable), rather than simply index into a itself.

So reading the customising broadcasting docs, I can see I can do something about this by implementing Base.broadcastable, so I do: Base.broadcastable(x::LongSequence) = x.

At this point trying again, I get an error:

julia> a .== DNA_G
ERROR: MethodError: no method matching getindex(::LongSequence{DNAAlphabet{4}}, ::CartesianIndex{1})
Closest candidates are:
  getindex(::LongSequence, ::UnitRange{var"#s30"} where var"#s30"<:Integer) at /Users/bward/repos/github/BioJulia/BioSequences/src/longsequences/indexing.jl:39
  getindex(::BioSequence, ::Colon) at /Users/bward/repos/github/BioJulia/BioSequences/src/biosequence/indexing.jl:66
  getindex(::BioSequence, ::AbstractVector{Bool}) at /Users/bward/repos/github/BioJulia/BioSequences/src/biosequence/indexing.jl:43

Ok, this is a bit weird to me, because my sequence type implements linear indexing, not cartesian, and satisfies the traits and methods for linear indexing. Indeed if I check the broadcast style:

julia> Base.BroadcastStyle(typeof(a))
Base.Broadcast.DefaultArrayStyle{1}()

So it know’s LongSequence is a 1 dimensional thing.

Ok so out of interest I do this: Base.getindex(x::LongSequence, i::Base.CartesianIndex{1}) = x[first(i.I)]

Such that trying to index with a 1d cartesian index just passes straight to linear indexing, and try again…

It looks like it worked…

Before

julia> using BenchmarkTools

julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]

julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG

julia> @benchmark a .== DNA_G
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
 Range (min … max):  610.503 ns …  11.402 μs  ┊ GC (min … max): 0.00% … 94.11%
 Time  (median):     620.303 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   636.760 ns ± 335.831 ns  ┊ GC (mean ± σ):  1.81% ±  3.24%

   ▁▆▆█▃                                                         
  ▃█████▅▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  611 ns           Histogram: frequency by time          753 ns <

 Memory estimate: 336 bytes, allocs estimate: 6.

After

julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]

julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG

julia> using BenchmarkTools

julia> @benchmark a .== DNA_G
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  554.548 ns …   9.028 μs  ┊ GC (min … max): 0.00% … 93.01%
 Time  (median):     565.371 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   579.382 ns ± 237.237 ns  ┊ GC (mean ± σ):  1.21% ±  2.78%

  ▃▇██▇▆▅▄▃▃▂▁▁▁▁                                               ▂
  ████████████████████▇▇██▆▇▆▆▆▇▇▅▄▅▇▆▆▆▆▅▅▃▆▇▇▇▆▅▄▅▄▄▃▄▃▃▅▁▅▁▅ █
  555 ns        Histogram: log(frequency) by time        738 ns <

 Memory estimate: 224 bytes, allocs estimate: 5.

So with broadcasting using the sequence itself, there’s -1 allocation occurring and it’s going slightly faster.

So at this point, I should ask - was what I did the right thing to do, or have I borked something that kinda works but not really? It still feels weird to me to be indexing a linear thing with the cartesian.

I suppose to move onto cases like a .= DNA_A or say complement.(a) which should return a sequence and not an array, I’ll be going more into the machinery and implementing a style, and possibly implementing the copy and copyto!, methods, and figuring out precedence for styles. Has anyone implemented broadcasting for a type that behaves like an array but isn’t exactly <: AbstractArray{1}, like BioSequences, that might provide me a good example?

Many thanks,
Ben.

2 Likes