Hi,

For strongly memory bound problems (like your previous post on array sum) and on a classical Desktop you may have some speed-up (x2-4) if you have several (2-4) memory channels. It is usually not the case on a laptop. Actually a floating point division takes some cycles and multithreading could be interesting.

In any case, I understood (from the previous link) that multithreading is not implicit for broadcast (which is a good think because it is difficult to anticipate if your code will be nested in another multithreaded context).

A possible solution if you have a strong interest in defining A and B separately, would be to fuse the div operation with the algorithm that uses V. A lazy definition of V could be nice

The curve corresponding to a MT+simd version your sum:

And the corresponding snippet:

```
function mysum_simd_threads(a::Vector)
total = zero(eltype(a))
n=length(a)
nchunk=4
partialsum=zeros(eltype(a),nchunk)
chunksize=n÷nchunk
Threads.@threads for c=1:nchunk
imin=(c-1)*chunksize+1
imax=imin+chunksize-1
stotal = zero(eltype(a))
@simd for i=imin:imax
@inbounds stotal += a[i]
end
partialsum[c]=stotal
end
total=sum(partialsum)
for i=nchunk*chunksize+1:n
total+=a[i]
end
return total
end
```