Allocation are only comming from the original array created and the fact that a fetch will always infer to Any which also explains why it grows with the number of thread. Thread are limited by memory bandwidth, Julia only tells you heap allocations not stack ones and uint256 takes quite some stacks which explains why thread starts to wait sometimes and stop making the performance better. You can try running the code with a faster processor than mine (I have a 8go ram ) you will get a lot better results
Bumper may get rid of Any inference but it won’t make my ram bigger since serial version of the code (without the macro) won’t allocate at all.
Also, sometimes you actually want a BigInt I think because you can use it as a Ref and code like it was a 1 element array which is hard but may be even better than my version with BitInteger. Heap is not always an enemy and it can reduce stack pressure a lot when well used.