Is there a faster way to do a matrix multiplication on a random subset of a matrix?

I have an algorithm that spends almost 2/3 of its time in a function that draws a subset of a matrix and then multiplies that with some weights. Number of selected rows << N rows.
I have tried to optimize it by preallocating everything, but with array slicing the multiplication is not very performant:

dem_tpartact = zeros(6000, 100)
dem_weights = rand(100, 100)
dem_data = rand(1_000_000, 100)
tmp_data = zeros(6000, 100)
dem_rp = floor.(Int, rand(6000) .* 1_000_000 .+ 1) #Selection of random indices

function permute_view!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
    @views mul!(tpartact, data[rp, :], weights)
end

@btime permute_view!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
42.953 ms (3 allocations: 30.75 KiB)

So I added a step where I allocate the selected rows to a temporary array an multiply that.

function permute_view2!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
    ii = 1:length(rp)
    for idx in eachindex(rp)
        @views tmp[ii[idx], :] .= data[rp[idx], :]
    end
    @views mul!(tpartact, tmp, weights)
end

@btime permute_view2!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
5.170 ms (0 allocations: 0 bytes)

This works pretty nice, but still is the main bottleneck of my code.
I compare the timings to a python version that I adapted and without this step the code runs 5-20x faster, but with it included, Julia version is slightly slower.

Is there anything more to be done here to make it more performant? Or maybe some other strategy that would be more efficient?

Edit:
For context, here is the analogous python code with timings:

data = np.random.rand(1000000, 100)
weights = np.random.rand(100, 100)

%%timeit
rp = np.floor(np.random.uniform(0, 1, 6000) * (1000000 - 1))
tpartact = np.dot(data[rp.astype(int), :], weights).T

5.16 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I’d check for the effect of row major (numpy) vs. column major (Julia) memory layout for matrices?

2 Likes

yeah. selecting columns rather than rows will be a lot faster.

1 Like

In other words, transpose the matrix and you’re good to go

You can directly sample indices, like this:

rand(1:1_000_000, 6000)

There’s no need to go through floats with rounding. In fact, the most correct would be

rand(axes(dem_data, 1), 6000)
1 Like

Good catch :slight_smile: I was so focused on the typical “iterate column-wise” that it did not occurred to me it makes more sense to copy columns.
Transposing the data matrix and reading columns does give it a nice boost:

 2.510 ms (0 allocations: 0 bytes)

but what is interesting, transposing the tmp array seems to not give any benefit, since it needs to be transposed back for the multiplication.
On a separate note, also using @spawned threads on the original version gave me a similar boost. With the transpose it shaves additional 0.5ms, but maybe now it is an overkill.

Right, thanks! I did want to change it later, this is just a direct translation from the python code (which in turn copied it from matlab code, which does explain the how it looks a bit :wink: ).
The axes variant does look nice, will see how it performs.

1 Like