What @threads is effectively doing is creating an anonymous function for the loop body, which captures the used variables from the surrounding scope. You should be able to see that if you do @macroexpand. This capturing is tricky and hard for the compiler to optimize, which can often lead to additional allocations happening.
One way around that may be to use a Ref of the captured variable instead, which you then access via r[] in the loop body, to get the original object back. This explicit boxing is much easier on the compiler.
A different approach would be asking whether the overhead of interthread communication alone is too much to make the parallelization worthwhile - that requires a more careful analysis of your original code though.