Piracy hunting

I made a script to detect piracy, and I’d like feedback. Eventually I hope to put the functionality in Aqua.jl. There is already an old thread with a similar script here: Pirate Hunter. However, the script to longer works for me, as it both have quite many false positives, and it also fails to detect some pretty blatant cases of piracy, so I rewrote it. Thanks to @oxinabox for the original implementation

How it works

Load a package MyPkg, then do hunt(MyPkg)

  • First, it traverses all objects reachable from the given module which are of type Union{Function, Type}, and calls methods on it, filtering away all methods originating in Core or Base, as these are assumed to not be piracy
  • It filters to keep methods that 1) originate from MyPkg, and 2) are pirates
  • Pirate methods is defined as a method defined in package X where neither the function, nor any of its arguments are from package X. The definition can get a little tricky with parametric types, varargs and unions, but I think my definition makes sense.

Current problems

  • Its suuuper slow, since it needs to traverse and examinate every reachable object. It takes about 2 minutes. Not sure how to improve it, but feedback is very welcome Edit: Thanks to comment from @kristoffer.carlsson, it’s now reasonably fast!
  • I’m not 100% sure my definition of piracy is correct for edge cases. I tried on a few of my own packages, but I’d like feedback
const Callable = Union{Function, Type}
const DEFAULT_PKGS = (Base.PkgId(Base), Base.PkgId(Core))

function all_methods(
    mod::Module,
    done_modules::Base.IdSet{Module},     # cached to prevent inf loops
    done_callables::Base.IdSet{Callable}, # cached to prevent inf loops
    result::Vector{Method},
    filter_default::Bool
)::Vector{Method}
    push!(done_modules, mod)
    for name in names(mod; all=true, imported=true)
        # names can list undefined symbols which cannot be eval'd
        isdefined(mod, name) || continue
        
        # Skip closures
        first(String(name)) == '#' && continue
        val = Core.eval(mod, name)
        
        if val isa Module
            if !in(val, done_modules)
                all_methods(val, done_modules, done_callables, result, filter_default)
            end
        elseif val isa Callable
            if !in(val, done_callables)
                for method in methods(val)
                    # Default filtering removes all methods defined in DEFAULT_PKGs,
                    # since these may pirate each other.
                    if !(filter_default && in(Base.PkgId(method.module), DEFAULT_PKGS))
                        push!(result, method)
                    end
                end
            end
            push!(done_callables, val)
        end
    end
    result
end

function all_methods(mod::Module; filter_default::Bool=true)
    all_methods(mod, Base.IdSet{Module}(), Base.IdSet{Callable}(), Method[], filter_default)
end

##################################
# Generic fallback
is_foreign(@nospecialize(x), pkg::Base.PkgId) = is_foreign(typeof(x), pkg)
is_foreign(mod::Module, pkg::Base.PkgId) = Base.PkgId(mod) != pkg

function is_foreign(@nospecialize(T::DataType), pkg::Base.PkgId)
    params = T.parameters
    # For Type{Foo}, we consider it to originate from the same as Foo
    if Base.typename(T).wrapper === Type
        return is_foreign(only(params), pkg)
    else
        # Both the type itself and all of its parameters must be foreign
        return is_foreign(T.name.module, pkg) && all(params) do param
            is_foreign(param, pkg)
        end
    end
end

function is_foreign(@nospecialize(U::UnionAll), pkg::Base.PkgId)
    # We do not consider extending Set{T} to be piracy, if T is not foreign.
    # Extending it goes against Julia style, but it's not piracy IIUC.
    is_foreign(U.body, pkg) && is_foreign(U.var, pkg)
end

is_foreign(@nospecialize(T::TypeVar), pkg::Base.PkgId) = is_foreign(T.ub, pkg)
is_foreign(@nospecialize(T::Core.TypeofVararg), pkg::Base.PkgId) = is_foreign(T.T, pkg)

function is_foreign(@nospecialize(U::Union), pkg::Base.PkgId)
    # Even if Foo is local, overloading f(::Union{Foo, Int}) with foreign f 
    # is piracy.
    any(Base.uniontypes(U)) do T
        is_foreign(T, pkg)
    end
end

function is_pirate(meth::Method)
    method_pkg = Base.PkgId(meth.module)

    signature = meth.sig
    while signature isa UnionAll
        signature = signature.body
    end

    all(signature.parameters) do parameter
        is_foreign(parameter, method_pkg)
    end
end

#######################################
hunt(;from::Module=Main) = filter(is_pirate, all_methods(from))
hunt(mod::Module; from::Module=Main) = hunt(Base.PkgId(mod); from=from)
hunt(pkg::Base.PkgId; from::Module=Main) = filter(all_methods(from)) do method
    is_pirate(method) &&
    Base.PkgId(method.module) === pkg
end
20 Likes

Use a Base.IdSet over a Set and perf should be better.

julia> @time hunt() # default
 89.321905 seconds (326.64 M allocations: 15.402 GiB, 4.56% gc time, 98.73% compilation time)

julia> @time hunt(); # `IdSet
  0.546979 seconds (1.10 M allocations: 60.358 MiB, 0.80% gc time)
18 Likes

That’s fantastic, thanks! Real easy ~150x speedup :smiley:

3 Likes

Wow nice!

Eventually I hope to put the functionality in Aqua.jl

This would be amazing. I am a big fan of Aqua!

1 Like

Doesn’t it say 98.73% compilation time for the first call?
So the speedup might be slower?

1.27% of 89 seconds is more than one second as far as I can tell, so the sped up function looks faster also at runtime?

1 Like

Yeah agree, I wanted to point out that it’s not 150x :smiley:

1 Like

Way to do something about global warming. Us Pastafarians salute you.

You will likely run this once. So what matters is end to end time.

5 Likes

Just to clarify: the second call didn’t include any compilation time, does it?
But it should have, if it’s called the first time?

I don’t remember. You can run them again to get a more accurate response if you want.

This may be a stupid suggestion, but couldn’t you do this analysis on the code rather than on the loaded module? So you parse the code, collect all imported functions and check that they are never given methods without locally defined types in the signature. Seems like that would be faster.

You were right!

But why is the compilation time of Set vs Base.IdSet so bad?

julia> include("/tmp/piracy_IdSet.jl")
hunt (generic function with 3 methods)

julia> @time hunt()
  0.485699 seconds (1.43 M allocations: 77.007 MiB, 1.98% gc time, 15.50% compilation time)
[1] adjoint(B::Union{BitMatrix, BitVector}) in LinearAlgebra at /home/fxw/.julia/juliaup/julia-1.8.2+0.x64
[...]

julia> @time hunt()
 0.394691 seconds (1.32 M allocations: 71.240 MiB, 1.70% gc time)
[1] adjoint(B::Union{BitMatrix, BitVector}) in LinearAlgebra at /home/fxw/.julia/juliaup/julia-1.8.2+0.x64
[...]
...


# new shell
julia> include("/tmp/piracy_set.jl")
hunt (generic function with 3 methods)

julia> @time hunt()
 73.374539 seconds (455.11 M allocations: 21.454 GiB, 5.86% gc time, 98.89% compilation time)
[1] adjoint(B::Union{BitMatrix, BitVector}) in LinearAlgebra at /home/fxw/.julia/juliaup/julia-1.8.2+0.x64
[...]

julia> @time hunt()
  0.424547 seconds (1.34 M allocations: 71.242 MiB, 3.20% gc time)
[1] adjoint(B::Union{BitMatrix, BitVector}) in LinearAlgebra at /home/fxw/.julia/juliaup/julia-1.8.2+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/bitarray.jl:237
[...]

It might be specializing the push! method on every function (which are all different types) or something along those lines.

1 Like

It might be specializing the push! method on every function (which are all different types) or something along those lines.

That’s it. The normal script compiles 380 specializations (according to SnoopCompile) - without Base.IdSet, it compiles nearly 191,000.
The script is also inherently type unstable, since the value of val cannot possibly be known at compile time. So even after all these methods are compiled, it’s slow to find the correct matching specialization out of the thousands compiled.

That’s a general pattern in Julia: If your code is very type unstable (with lots of possible types), it’s more efficient to not specialize your code than it is to specialize.

2 Likes