Extra allocation in @thread loop in v1.2

Consider the following MWE of threaded element-wise computation:

using Base.Threads
using LinearAlgebra
using BenchmarkTools

const N = 100000
a = rand(N)
b = rand(N)

function foo(a, b)
    @threads for i ∈ eachindex(a)
        a[i] = (a[i] + b[i]) ^ (a[i] - b[i])
    end
end

@btime foo($a, $b)

I tested it under both Julia 1.1.1 and Julia 1.2.0, both of which are official binaries:

~/codes » /home/opt/julia-1.1.1/bin/julia test.jl                                     pshi@discover
  75.924 μs (1 allocation: 32 bytes)
---------------------------------------------------------------------------------------------------------
~/codes » /home/opt/julia-1.2.0/bin/julia test.jl                                     pshi@discover
  114.931 μs (133 allocations: 13.53 KiB)

And the computer is:

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 20

There is significant more allocation and worse performance in 1.2 if I do so. Is there anything I should modify in the new version (seems not according to the changelog)? Or could you guys reproduce/confirm this?

Thanks!

This maybe related to this issue.

Just guessing, but it could have something to do with the use of invokelatest here:

https://github.com/JuliaLang/julia/blob/c6da87ff4bc7a855e217856757ad3413cf6d1f79/base/threadingconstructs.jl#L71

which was introduced in https://github.com/JuliaLang/julia/pull/30838.