Why other calculation will effect the time consumed by Threads.@threads

When I test my code, I found that the time consumed by Threads.@threads in actual environment is bigger than the test environment. And I think the reason is that other calculation in the actual environment effects the performance of Threads.@threads, so I write a demo to show that:

using TimerOutputs
using LinearAlgebra
using Polyester

@show Threads.nthreads()
LinearAlgebra.BLAS.set_num_threads(1)

function testmul0(A,B,C)
    for i in 1:1000
        mul!(C,A,B)
    end
end
function testmul(A,B,C)
    for i in 1:100
        mul!(C[i],A,B)
    end
end
function testmul_thread(A,B,C)
    Threads.@threads for i in 1:100
        mul!(C[i],A,B)
    end
end

function testmuls(A,B,C)
    for i in 1:20
        @timeit "testmul0" testmul0(A,B,C[1])
    end
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
    end
end

function testmuls2(A,B,C)
    for i in 1:20
        # @timeit "testmul" testmul(A,B,C)
        @timeit "testmul0" testmul0(A,B,C[1])
        @timeit "testmul_th" testmul_thread(A,B,C)
    end
end

A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:100]
testmuls(A,B,C);    # first time run
testmuls2(A,B,C);

function bar()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:100]
    reset_timer!()
    testmuls(A,B,C)
    show(TimerOutputs.get_defaulttimer());
    reset_timer!()
    testmuls2(A,B,C)
    show(TimerOutputs.get_defaulttimer());
end
bar()

result:

Threads.nthreads() = 5
 ───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:       6.24s / 100.0%           55.4KiB /  96.7%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul0         20    6.10s   97.9%   305ms     0.00B    0.0%    0.00B
 testmul_th       20    131ms    2.1%  6.57ms   53.6KiB  100.0%  2.68KiB
 ─────────────────────────────────────────────────────────────────────── 
───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:       6.32s / 100.0%           55.8KiB /  96.8%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul0         20    6.10s   96.6%   305ms     0.00B    0.0%    0.00B
 testmul_th       20    218ms    3.4%  10.9ms   54.0KiB  100.0%  2.70KiB
 ───────────────────────────────────────────────────────────────────────

testmul0 is just a time consuming calculation.
In function testmuls, I repeat call testmul0 and testmul_th in two for loops, and the time consumed by testmul_th is only 6.57ms. But when I repeat call testmul0 and testmul_th in one for loop, the time consumed by testmul_th increases to 10.9ms.
So why the calculation of testmul0 will effect the time of testmul_th?

Thank you very much.