I have two parallel arrays `a`

and `b`

of type `Vector{Complex128}`

, and each of them is very large (say, each contains `2^30`

elements, thus each requires around 16 GB of RAM). I would like to multiply these arrays elementwise on a machine with many cores available. The obvious way is to use the “dot” syntax

```
a .*= b
```

but this only uses a single core. At the moment, there is no parallel version of `broadcast`

, so I am trying to figure out what the next best alternative may be. As far as I can tell, none of the BLAS operations fit this pattern (though `BLAS.gbmv!`

comes close, except I would need to pass `a`

as two different arguments, and I am pretty sure BLAS assumes non-overlapping memory). I also tried using `Threads.@threads`

but this did not speed things up (I can elaborate more on this if necessary).

What is the most efficient way to implement such a calculation? I’m getting close to implementing this part of the calculation as a kernel written in C using pthreads.

Thanks!