I think that if you call the BLAS then there is no way julia can avoid running the whole code (it cannot know if the code compiled in another language is afecting the program or not).
In this case all the performance problem comes form a tiny detail that is hard to get.
vv_k*W
will not call the BLAS in this case (because the types of vv_k and W are different)
zer_0 = 0
pl_1 = 1
vv_k = rand([zer_0,pl_1],Nsamples,Nv+1)
vv_k*W
will call the blass in this case (because the types of vv_k and W are the same)
zer_0 = 0.
pl_1 = 1.
vv_k = rand([zer_0,pl_1],Nsamples,Nv+1)
And the time it takes to compute that can be very different
# vv_k*W timing
N = 3000 same type 0.431971 seconds
N = 3000 different type 36.492838 seconds
Here I did some tests
The random number generation you mention and the rewritting of the code to avoid allocations are important performance tricks you can use but they are irrelevant compared to the BLAS/not BLAS call when matrices contain 1000x1000 elements or more.