I have a PR to BandedMatrices with the following SnoopPrecompile instructions:

@precompile_setup begin
vs = ([1.0], Float32[1.0], ComplexF32[1.0], ComplexF64[1.0])
Bs = Any[BandedMatrix(0 => v) for v in vs]
@precompile_all_calls begin
for B in Bs, op in (+, -, *)
op(B, B)
end
for (v, B) in zip(vs, Bs)
B * v
end
end
end

The compilation fully removes the TTFX overhead in the +, - and * operations, but doesn’t in the B * v one. I wonder why it doesn’t work for the matrix-vector product (while it does for the matrix-matrix one)? Both set of calls are ultimately handled by ArrayLayouts, and would presumably descend to BLAS calls.

Have you checked for invalidations? See the script here (while the plotting is very nice, it’s optional, the key info is in staletrees and you can just display it).

julia> using SnoopCompileCore
julia> invalidations = @snoopr using BandedMatrices;
julia> tinf = @snoopi_deep begin
B = BandedMatrix(0 => zeros(10));
v = rand(size(B,2));
B * v
end;
julia> using SnoopCompile
julia> trees = invalidation_trees(invalidations);
julia> staletrees = precompile_blockers(trees, tinf)
SnoopCompile.StaleTree[]

The most offensive method leading to invalidation seems to be

julia> trees[end]
inserting all(f::Function, x::FillArrays.AbstractFill) @ FillArrays ~/.julia/packages/FillArrays/o1UXZ/src/FillArrays.jl:590 invalidated:
backedges: 1: superseding all(f::Function, a::AbstractArray; dims) @ Base reducedim.jl:1007 with MethodInstance for all(::TOML.Internals.Printer.var"#1#2", ::AbstractVector) (256 children)

and I wonder if there’s a way to resolve this in any case?

What does ProfileView.view(flamegraph(tinf)) show? You can left-click on the base of each flame to see the entry point for type-inference. You can also see the list with tinf.children.

Ah no, my bad, I was on the master branch. On the PR branch including SnoopPrecompile, staletrees isn’t empty.

julia> using SnoopCompileCore
julia> invalidations = @snoopr using BandedMatrices;
[ Info: Precompiling BandedMatrices [aae01518-5342-5314-be14-df237901396f]
julia> tinf = @snoopi_deep begin
B = BandedMatrix(0 => zeros(10));
v = rand(size(B,2));
B * v
end;
julia> using SnoopCompile
julia> trees = invalidation_trees(invalidations);
julia> staletrees = precompile_blockers(trees, tinf)
3-element Vector{SnoopCompile.StaleTree}:
inserting eltype(::Type{T}) where T<:StableIndexedCursor @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/cursors.jl:332 invalidated:
mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children
inserting eltype(::Type{T}) where T<:StableCursor @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/cursors.jl:295 invalidated:
mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children
inserting eltype(::Type{<:AbstractTrees.TreeIterator{T}}) where T @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/iteration.jl:68 invalidated:
mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children

It looks like the * method that gets used for a BandedMatrix has some inference failures. If those are fixable, it should resolve the invalidations. And of course you might also get a performance boost as well.

Thanks, that helps! The issue might be that _banded_muladd is recursive, which leads to inference failure

function _banded_muladd!(α::T, A, x::AbstractVector, β, y) where T
m, n = size(A)
(length(y) ≠ m || length(x) ≠ n) && throw(DimensionMismatch("*"))
l, u = bandwidths(A)
if -l > u # no bands
_fill_lmul!(β, y)
elseif l < 0
_banded_muladd!(α, view(A, :, 1-l:n), view(x, 1-l:n), β, y)
elseif u < 0
y[1:-u] .= zero(T)
_banded_muladd!(α, view(A, 1-u:m, :), x, β, view(y, 1-u:m))
y
else
_banded_gbmv!('N', α, A, x, β, y)
end
end

Good guess, as recursive calls that change some of the types present real challenges for inference. If in the outer call you can predict which branch the inner call will take (i.e., the output of bandwidths is predictable based on the view), then presumably this is cleanly fixable. If not, a good bandaid might be to make self-calls Base.invokelatest(_banded_muladd!, α, ...).