I have an algorithm that spends almost 2/3 of its time in a function that draws a subset of a matrix and then multiplies that with some weights. `Number of selected rows << N rows`

.

I have tried to optimize it by preallocating everything, but with array slicing the multiplication is not very performant:

```
dem_tpartact = zeros(6000, 100)
dem_weights = rand(100, 100)
dem_data = rand(1_000_000, 100)
tmp_data = zeros(6000, 100)
dem_rp = floor.(Int, rand(6000) .* 1_000_000 .+ 1) #Selection of random indices
function permute_view!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
@views mul!(tpartact, data[rp, :], weights)
end
@btime permute_view!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
42.953 ms (3 allocations: 30.75 KiB)
```

So I added a step where I allocate the selected rows to a temporary array an multiply that.

```
function permute_view2!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
ii = 1:length(rp)
for idx in eachindex(rp)
@views tmp[ii[idx], :] .= data[rp[idx], :]
end
@views mul!(tpartact, tmp, weights)
end
@btime permute_view2!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
5.170 ms (0 allocations: 0 bytes)
```

This works pretty nice, but still is the main bottleneck of my code.

I compare the timings to a python version that I adapted and without this step the code runs 5-20x faster, but with it included, Julia version is slightly slower.

Is there anything more to be done here to make it more performant? Or maybe some other strategy that would be more efficient?

Edit:

For context, here is the analogous python code with timings:

```
data = np.random.rand(1000000, 100)
weights = np.random.rand(100, 100)
%%timeit
rp = np.floor(np.random.uniform(0, 1, 6000) * (1000000 - 1))
tpartact = np.dot(data[rp.astype(int), :], weights).T
5.16 ms Â± 162 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)
```