Why isn't SnoopPrecompile being able to fully compile the calls in this case?

I have a PR to BandedMatrices with the following SnoopPrecompile instructions:

@precompile_setup begin
	vs = ([1.0], Float32[1.0], ComplexF32[1.0], ComplexF64[1.0])
	Bs = Any[BandedMatrix(0 => v) for v in vs]
	@precompile_all_calls begin
		for B in Bs, op in (+, -, *)
			op(B, B)
		end
		for (v, B) in zip(vs, Bs)
			B * v
		end
	end
end

The compilation fully removes the TTFX overhead in the +, - and * operations, but doesn’t in the B * v one. I wonder why it doesn’t work for the matrix-vector product (while it does for the matrix-matrix one)? Both set of calls are ultimately handled by ArrayLayouts, and would presumably descend to BLAS calls.

1 Like

Have you checked for invalidations? See the script here (while the plotting is very nice, it’s optional, the key info is in staletrees and you can just display it).

staletrees seems empty in this case

julia> using SnoopCompileCore

julia> invalidations = @snoopr using BandedMatrices;

julia> tinf = @snoopi_deep begin
           B = BandedMatrix(0 => zeros(10));
           v = rand(size(B,2));
           B * v
       end;

julia> using SnoopCompile

julia> trees = invalidation_trees(invalidations);

julia> staletrees = precompile_blockers(trees, tinf)
SnoopCompile.StaleTree[]

The most offensive method leading to invalidation seems to be

julia> trees[end]
inserting all(f::Function, x::FillArrays.AbstractFill) @ FillArrays ~/.julia/packages/FillArrays/o1UXZ/src/FillArrays.jl:590 invalidated:
   backedges: 1: superseding all(f::Function, a::AbstractArray; dims) @ Base reducedim.jl:1007 with MethodInstance for all(::TOML.Internals.Printer.var"#1#2", ::AbstractVector) (256 children)

and I wonder if there’s a way to resolve this in any case?

If staletrees is empty then (barring bugs in SnoopCompile) presumably the invalidations are not the explanation to the failure to use cached code for *. The all invalidation looks like it only affects TOML parsing, and will likely be fixed by use invokelatest to prevent invalidations in TOML by KristofferC · Pull Request #48083 · JuliaLang/julia · GitHub.

What does ProfileView.view(flamegraph(tinf)) show? You can left-click on the base of each flame to see the entry point for type-inference. You can also see the list with tinf.children.

1 Like

Ah no, my bad, I was on the master branch. On the PR branch including SnoopPrecompile, staletrees isn’t empty.

julia> using SnoopCompileCore

julia> invalidations = @snoopr using BandedMatrices;
[ Info: Precompiling BandedMatrices [aae01518-5342-5314-be14-df237901396f]

julia> tinf = @snoopi_deep begin
           B = BandedMatrix(0 => zeros(10));
           v = rand(size(B,2));
           B * v
       end;

julia> using SnoopCompile

julia> trees = invalidation_trees(invalidations);

julia> staletrees = precompile_blockers(trees, tinf)
3-element Vector{SnoopCompile.StaleTree}:
 inserting eltype(::Type{T}) where T<:StableIndexedCursor @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/cursors.jl:332 invalidated:
   mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children

 inserting eltype(::Type{T}) where T<:StableCursor @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/cursors.jl:295 invalidated:
   mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children

 inserting eltype(::Type{<:AbstractTrees.TreeIterator{T}}) where T @ AbstractTrees ~/.julia/packages/AbstractTrees/x9S7q/src/iteration.jl:68 invalidated:
   mt_backedges: 1: MethodInstance for eltype(::AbstractVector{Float64}) at depth 1 with 17 children blocked InferenceTimingNode: 0.000295/0.411388 on *(::BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, ::Vector{Float64}) with 1 direct children
1 Like

It looks like the * method that gets used for a BandedMatrix has some inference failures. If those are fixable, it should resolve the invalidations. And of course you might also get a performance boost as well.

1 Like

Indeed, Cthulhu seems to hint at something

• %407 = invoke _banded_muladd!(::Float64,::SubArray{…},::Array{…},::Float64,::SubArray{…})::Any

but frustatingly I hit `UndefVarError`: `mi` not defined · Issue #333 · JuliaDebug/Cthulhu.jl · GitHub, so I guess it’ll take a while to get to the bottom of this. Thanks a lot for looking into this!

Can you just change line 476 to additional_descend(get_mi(curs))?

For a good PR the hard part will be capturing a test, but this seems likely to fix it.

Thanks, that helps! The issue might be that _banded_muladd is recursive, which leads to inference failure

function _banded_muladd!(α::T, A, x::AbstractVector, β, y) where T
    m, n = size(A)
    (length(y) ≠ m || length(x) ≠ n) && throw(DimensionMismatch("*"))
    l, u = bandwidths(A)
    if -l > u # no bands
        _fill_lmul!(β, y)
    elseif l < 0
        _banded_muladd!(α, view(A, :, 1-l:n), view(x, 1-l:n), β, y)
    elseif u < 0
        y[1:-u] .= zero(T)
        _banded_muladd!(α, view(A, 1-u:m, :), x, β, view(y, 1-u:m))
        y
    else
        _banded_gbmv!('N', α, A, x, β, y)
    end
end

Good guess, as recursive calls that change some of the types present real challenges for inference. If in the outer call you can predict which branch the inner call will take (i.e., the output of bandwidths is predictable based on the view), then presumably this is cleanly fixable. If not, a good bandaid might be to make self-calls Base.invokelatest(_banded_muladd!, α, ...).

This does seem cleanly fixable, and I’ve created a PR. This fix resolves the precompilation issue completely.

1 Like