Allocations on threaded prefix algorithm

I’m going through the Parallel Computing course at juliaacademy.com. They give this example of a prefix code being rewritten in parallel. However, the video does not go through it, and when I run it on my own, I get a lot of memory allocations in the benchmark for prefix_threads! that I don’t know what it’s from.

Can anyone tell me why it’s doing so many allocations?

function prefix_serial!(y, ⊕)
    for i=2:length(y)
        y[i] = y[i-1] ⊕ y[i]
    end
    y
end

function prefix!(y, ⊕)
    l=length(y)
    k=ceil(Int, log2(l))
    @inbounds for j=1:k, i=2^j:2^j:min(l, 2^k)              #"reduce"
        y[i] = y[i-2^(j-1)] ⊕ y[i]
    end
    @inbounds for j=(k-1):-1:1, i=3*2^(j-1):2^j:min(l, 2^k) #"expand"
        y[i] = y[i-2^(j-1)] ⊕ y[i]
    end
    y
end

using .Threads
function prefix_threads!(y, ⊕)
    l=length(y)
    k=ceil(Int, log2(l))
    for j=1:k
        @threads for i=2^j:2^j:min(l, 2^k)       #"reduce"
            @inbounds y[i] = y[i-2^(j-1)] ⊕ y[i]
        end
    end
    for j=(k-1):-1:1
        @threads for i=3*2^(j-1):2^j:min(l, 2^k) #"expand"
            @inbounds y[i] = y[i-2^(j-1)] ⊕ y[i]
        end
    end
    return y
end

A = rand(500_000);

using BenchmarkTools
@btime prefix_serial!($(copy(A)), +);
@btime prefix!($(copy(A)), +);
@btime prefix_threads!($(copy(A)), +);

1.129 ms (0 allocations: 0 bytes)
2.130 ms (0 allocations: 0 bytes)
1.297 ms (1160 allocations: 126.78 KiB)