Slower with threads

What you want to do is avoid creating new arrays, allocating memory, on the heap. If you do need to allocate memory, you should see if you can reuse them. This is what the bang functions are for such as mul!. Another method is to use broadcast assignment .=.

julia> @allocated A = rand(1:10, 1024, 1024) # Allocates 8 MiB
8388656

julia> @allocated B = rand(1:10, 1024, 1024) # Allocates 8 MiB
8388656

julia> @allocated C = zeros(1024, 1024) # Allocates 8 MiB
8388656

julia> @allocated A .+ B # Allocates 8 MiB to store result
8388720

julia> @allocated C .= A .+ B # Allocates only 64 bytes, avoids 8 MiB allocation
64

julia> using LinearAlgebra

julia> @allocated A * B # allocations due to compilation
115689839

julia> @allocated A * B # > 8 MiB allocated to store result
8420144

julia> @allocated mul!(C, A, B) # allocation due to compilation
123465049

julia> @allocated mul!(C, A, B) # ~31 KiB needed for multiplication, avoid allocating 8 MiB
31488

julia> @allocated C .= A .* B # elementwise multiplication avoids allocating 8 MiB
64

StaticArrays.jl uses an optimization for small amounts of memory. The larger strategy to control memory allocation tightly.