Hello! I’m currently trying to customise and improve how Broadcasting is done with BioSequences.jl - types. Specifically LongSequence
at this moment, for a v3 release. And I could do with some help and advice doing it properly.
So, BioSequence types are not strictly arrays in that they don’t inherit from any abstract array type. But they implement and satisfy indexing, iteration etc. As if they were vectors - they have one dimension, they have the IndexStyle
of Base.IndexLinear
, they have length
, size
will return (length(x),)
and axes
will return (OneTo(length(x)), )
We want to customise broadcasting such that - in some cases, the output type is a LongSequence - but tbh most of the time it’s probably a vector/array.
So I want to start with a simple example: a .== DNA_G
, where a is a LongSequence
.
julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]
julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG
julia> a .== DNA_G
20-element BitVector:
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
julia> Base.broadcastable(a)
20-element Vector{DNA}:
DNA_A
DNA_T
DNA_C
DNA_G
DNA_A
DNA_T
DNA_C
DNA_G
DNA_A
DNA_T
DNA_C
DNA_G
DNA_A
DNA_T
DNA_C
DNA_G
DNA_A
DNA_T
DNA_C
DNA_G
So for the simple example of a .== DNA_G
, we expect an array (well a bitarray) out, and we get it. BUT we can see broadcasting spends the time and memory to build a vector of DNA
(as per Base.broadcastable
), rather than simply index into a
itself.
So reading the customising broadcasting docs, I can see I can do something about this by implementing Base.broadcastable
, so I do: Base.broadcastable(x::LongSequence) = x
.
At this point trying again, I get an error:
julia> a .== DNA_G
ERROR: MethodError: no method matching getindex(::LongSequence{DNAAlphabet{4}}, ::CartesianIndex{1})
Closest candidates are:
getindex(::LongSequence, ::UnitRange{var"#s30"} where var"#s30"<:Integer) at /Users/bward/repos/github/BioJulia/BioSequences/src/longsequences/indexing.jl:39
getindex(::BioSequence, ::Colon) at /Users/bward/repos/github/BioJulia/BioSequences/src/biosequence/indexing.jl:66
getindex(::BioSequence, ::AbstractVector{Bool}) at /Users/bward/repos/github/BioJulia/BioSequences/src/biosequence/indexing.jl:43
Ok, this is a bit weird to me, because my sequence type implements linear indexing, not cartesian, and satisfies the traits and methods for linear indexing. Indeed if I check the broadcast style:
julia> Base.BroadcastStyle(typeof(a))
Base.Broadcast.DefaultArrayStyle{1}()
So it know’s LongSequence is a 1 dimensional thing.
Ok so out of interest I do this: Base.getindex(x::LongSequence, i::Base.CartesianIndex{1}) = x[first(i.I)]
Such that trying to index with a 1d cartesian index just passes straight to linear indexing, and try again…
It looks like it worked…
Before
julia> using BenchmarkTools
julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]
julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG
julia> @benchmark a .== DNA_G
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
Range (min … max): 610.503 ns … 11.402 μs ┊ GC (min … max): 0.00% … 94.11%
Time (median): 620.303 ns ┊ GC (median): 0.00%
Time (mean ± σ): 636.760 ns ± 335.831 ns ┊ GC (mean ± σ): 1.81% ± 3.24%
▁▆▆█▃
▃█████▅▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
611 ns Histogram: frequency by time 753 ns <
Memory estimate: 336 bytes, allocs estimate: 6.
After
julia> using BioSequences
[ Info: Precompiling BioSequences [7e6ae17a-c86d-528c-b3b9-7f778a29fe59]
julia> a = dna"ATCGATCGATCGATCGATCG"
20nt DNA Sequence:
ATCGATCGATCGATCGATCG
julia> using BenchmarkTools
julia> @benchmark a .== DNA_G
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
Range (min … max): 554.548 ns … 9.028 μs ┊ GC (min … max): 0.00% … 93.01%
Time (median): 565.371 ns ┊ GC (median): 0.00%
Time (mean ± σ): 579.382 ns ± 237.237 ns ┊ GC (mean ± σ): 1.21% ± 2.78%
▃▇██▇▆▅▄▃▃▂▁▁▁▁ ▂
████████████████████▇▇██▆▇▆▆▆▇▇▅▄▅▇▆▆▆▆▅▅▃▆▇▇▇▆▅▄▅▄▄▃▄▃▃▅▁▅▁▅ █
555 ns Histogram: log(frequency) by time 738 ns <
Memory estimate: 224 bytes, allocs estimate: 5.
So with broadcasting using the sequence itself, there’s -1 allocation occurring and it’s going slightly faster.
So at this point, I should ask - was what I did the right thing to do, or have I borked something that kinda works but not really? It still feels weird to me to be indexing a linear thing with the cartesian.
I suppose to move onto cases like a .= DNA_A
or say complement.(a)
which should return a sequence and not an array, I’ll be going more into the machinery and implementing a style, and possibly implementing the copy
and copyto!
, methods, and figuring out precedence for styles. Has anyone implemented broadcasting for a type that behaves like an array but isn’t exactly <: AbstractArray{1}
, like BioSequences, that might provide me a good example?
Many thanks,
Ben.