Help improve my threaded map function?

I tried writing a @thread’ed map function for array arguments. Very roughly imitating the general idea for how I see Base.map handles inferring the type of the result, I came up with,

function tmap(f,args::AbstractArray...)
    cargs = collect(zip(args...))
    n = length(cargs)
    T = Core.Inference.return_type(f, Tuple{map(typeof,cargs[1])...})
    ans = Vector{T}(n)
    @threads for i=1:n
        ans[i] = f(cargs[i]...)
    end
    ans
end

It works, but I was surprised to see that if I replace the type inference line above with just T=Any, I get a ~20% speed improvement. Here’s what I benchmarked, just multiplying a bunch of matrices together,

m = [randn(128,128) for i=1:10]
@benchmark tmap(*,$m,$m)

I get 100ms for the inferred version, but 80ms without it.

The overhead of the call to Inference.return_type appears negligible. Any ideas what’s going on?