Is there a faster way to do a matrix multiplication on a random subset of a matrix?

mkoculak · April 16, 2023, 2:40pm

I have an algorithm that spends almost 2/3 of its time in a function that draws a subset of a matrix and then multiplies that with some weights. Number of selected rows << N rows.
I have tried to optimize it by preallocating everything, but with array slicing the multiplication is not very performant:

dem_tpartact = zeros(6000, 100)
dem_weights = rand(100, 100)
dem_data = rand(1_000_000, 100)
tmp_data = zeros(6000, 100)
dem_rp = floor.(Int, rand(6000) .* 1_000_000 .+ 1) #Selection of random indices

function permute_view!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
    @views mul!(tpartact, data[rp, :], weights)
end

@btime permute_view!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
42.953 ms (3 allocations: 30.75 KiB)

So I added a step where I allocate the selected rows to a temporary array an multiply that.

function permute_view2!(tpartact::Array, weights::Array, data::Array, tmp::Array, rp::Vector{Int64})
    ii = 1:length(rp)
    for idx in eachindex(rp)
        @views tmp[ii[idx], :] .= data[rp[idx], :]
    end
    @views mul!(tpartact, tmp, weights)
end

@btime permute_view2!($dem_tpartact, $dem_weights, $dem_data, $tmp_data, $dem_rp);
5.170 ms (0 allocations: 0 bytes)

This works pretty nice, but still is the main bottleneck of my code.
I compare the timings to a python version that I adapted and without this step the code runs 5-20x faster, but with it included, Julia version is slightly slower.

Is there anything more to be done here to make it more performant? Or maybe some other strategy that would be more efficient?

Edit:
For context, here is the analogous python code with timings:

data = np.random.rand(1000000, 100)
weights = np.random.rand(100, 100)

%%timeit
rp = np.floor(np.random.uniform(0, 1, 6000) * (1000000 - 1))
tpartact = np.dot(data[rp.astype(int), :], weights).T

5.16 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

goerch · April 16, 2023, 4:18pm

I’d check for the effect of row major (numpy) vs. column major (Julia) memory layout for matrices?

Oscar_Smith · April 16, 2023, 4:45pm

yeah. selecting columns rather than rows will be a lot faster.

gdalle · April 16, 2023, 6:09pm

In other words, transpose the matrix and you’re good to go

DNF · April 16, 2023, 6:14pm

You can directly sample indices, like this:

rand(1:1_000_000, 6000)

There’s no need to go through floats with rounding. In fact, the most correct would be

rand(axes(dem_data, 1), 6000)

mkoculak · April 16, 2023, 9:03pm

Good catch I was so focused on the typical “iterate column-wise” that it did not occurred to me it makes more sense to copy columns.
Transposing the data matrix and reading columns does give it a nice boost:

 2.510 ms (0 allocations: 0 bytes)

but what is interesting, transposing the tmp array seems to not give any benefit, since it needs to be transposed back for the multiplication.
On a separate note, also using @spawned threads on the original version gave me a similar boost. With the transpose it shaves additional 0.5ms, but maybe now it is an overkill.

Right, thanks! I did want to change it later, this is just a direct translation from the python code (which in turn copied it from matlab code, which does explain the how it looks a bit ).
The axes variant does look nice, will see how it performs.

Topic		Replies	Views
Batched matrix-multiplication optimization Performance performance , linearalgebra	15	310	May 30, 2025
Speeding up matrix multiplication with a subset of matrix's rows General Usage linearalgebra	12	956	July 24, 2019
Speeding up operations on large arrays Performance	6	506	December 21, 2022
Better performance to blockwise matrix multiplication Performance memory-allocation	2	1448	May 3, 2018
PermutedDimsArray slower than permutedims? New to Julia performance , blas	3	1604	September 10, 2020

Is there a faster way to do a matrix multiplication on a random subset of a matrix?

Related topics