Accelerate column-wise subtraction of a vector to a matrix

I try to accelerate the computation of the following operation:

  • I have a matrix X (n, p) and a vector v (p). For each j {= 1,…,p}, I want to compute the differences X[i, j] - v[j] {i = 1,…,n}.

and, this, on CPU (not GPU), with a function that is not ‘inplace’, and on quite large matrices (see the reproductible example below).

(X and v contain numbers of same type, but this type can vary depending on cases: Float64, …32, Int, etc.)

In other terms, my objective is to get a fast function that returns a new matrix from the column-wise substraction of v to X.

I tested many variants and, at present, my two faster versions are the following:

f1(X, v) = X .- v'

function f6_th(X, v)   # fastest one
    p = size(X)[2]
    zX = similar(X)
    @Threads.threads for j = 1:p
        zX[:, j] .= view(X, :, j) .- v[j]
    end
    zX
end

An example:

using Chairmarks
n = 10^6 ; p = 500
X = rand(n, p) 
v = rand(p)

@be f1($X, $v) samples = 3 evals = 1 seconds = 10 
Benchmark: 3 samples with 1 evaluation
        991.649 ms (3 allocs: 3.725 GiB, 13.86% gc time)
        1.010 s (3 allocs: 3.725 GiB, 13.98% gc time)
        1.261 s (3 allocs: 3.725 GiB, 31.36% gc time)

@be f6_th($X, $v) samples = 3 evals = 1 seconds = 10  
Benchmark: 3 samples with 1 evaluation
        410.142 ms (45 allocs: 3.725 GiB, 1.86% gc time)
        611.232 ms (45 allocs: 3.725 GiB, 32.62% gc time)
        611.792 ms (45 allocs: 3.725 GiB, 30.78% gc time)

[For info, I have ‘inplace’ functions (e.g. f1!) that are faster than above but I am not looking for an inplace version.]

My configuration:

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 16 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

I would be gratefully interested by any ideas.

If this is for a real application, you’d almost certainly be better off (a) combining this calculation with previous/subsequent calculations on the matrix, rather than optimizing this loop in isolation; and (b) not creating the matrix X .- v' at all, if possible (by, again, combining this calculation with the subsequent processing).

1 Like

I expect you can’t do much better than these. A calculation this short is going to be memory-bound for matrices bigger than your cache, which is the case here. Note that multithreading couldn’t make this much faster than it already was, due to the memory limitations.

There isn’t much else you can do with all that data that is any faster. Even getting the data in the first place will take about this long (i.e., benchmarking copy takes virtually the same time for me). Are you sure this is a bottleneck in your code?

The main ways to make this faster involve reducing the bandwidth requirement:

  1. Use a smaller element type like Float32 to shrink the memory use.
  2. Restructure your calculation to perform the operation in-place (half the amount of memory accessed). However, note that copying the data and then doing the operation in place will probably be even worse (in total) than the out-of-place version.
  3. Interleave this calculation with some earlier/later work so that has more to do on the same memory at once. Do you need to transform the entire matrix all beforehand or are you processing each column separately anyway? What works or doesn’t here will depend on exactly what calculation you’re doing.
5 Likes

Thanks for your answer. Yes this is for real applications. I think I understood what you meant (I did not think about this before), but, at this stage, it is much more convenient for me to stay on a ‘loop in isolation’.

For info, I also built this inplace version (among others), which is lighter and faster:

f1!(X, v) = X .-= v'

zX = copy(X)
@time f1!(zX, v) ;
 0.280371 seconds

I actually use fi1! in my applications, but I was also interested to accelarate a non-inplace version.

Thank you @mikmoore for all your inputs, on which I agree [no it is not really a bottleneck but trying to gain in speed; About Float32, yes I often use this way; for your last point, yes I need to transform the entire matrix all beforehand].
Now I better see the non-compressible limits.

It may be more convenient, but it’s usually a waste of time optimizing a loop that does so little work (one subtraction per element!) in isolation, and there is little that can be gained.

4 Likes

@stevengj It may be more convenient, but it’s usually a waste of time optimizing a loop that does so little work (one subtraction per element!) in isolation, and there is little that can be gained.

Ok. I understand well the idea and agree on the general interest, but let consider, as an example among others, the simple case where the matrix returned after the substraction has to be transposed and multiplied by another matrix, Y (n, q).

What I use to do is:

zX = f6_th(X, v)
res = zX' * Y   # or Y' * zX 

or

zX = copy(X)
f1!(zX, v)
res = zX' * Y 

On this example, I don’t see how I can combine the substractions with the matrice-components multiplications without reprogramming from scratch the matrix multiplication (that will be much slower than the generic ’ * ').

In that case the cost of the subtraction is probably negligible compare to the cost of the multiplication, so why optimize it?

In particular, if you have an MxN matrix and you are subtracting a vector from each row or column (a rank-1 update), then the cost is O(MN). If you then transpose and multiply by an MxP matrix, the cost is O(MNP), which should swamp the cost of the subtraction unless P is tiny.

I think you can gain a lot, at least when q is small, by noting that the subtraction need not be applied in the first place, since the result R_{ij} = sum_k (X_{ki} - v_i) * Y_{kj}, hence R = X' * Y - v * sum(Y, dims=1).

EDIT: I will compare to the simplest version from above:

using BenchmarkTools
using LinearAlgebra

n = 10^6; p = 500; q = 200;
X = rand(n, p) 
v = rand(p)
Y = rand(n, q)

g1(X, Y, v) = (X .- v')' * Y
g2(X, Y, v) = X' * Y - v * sum(Y, dims=1)

g1(X, Y, v) ≈ g2(X, Y, v) # true

@btime g1($X, $Y, $v); # 2.131 s (4 allocations: 3.73 GiB)
@btime g2($X, $Y, $v); # 856.397 ms (9 allocations: 2.29 MiB)
2 Likes

Thanks for your inputs. Yes Y is tiny in general, and then it is not so negligible:

q = 10; Y = rand(n, q)

@be f1($X, $v) samples = 3 evals = 1 seconds = 10 
Benchmark: 3 samples with 1 evaluation
        930.097 ms (3 allocs: 3.725 GiB, 20.69% gc time)
        931.005 ms (3 allocs: 3.725 GiB, 20.63% gc time)
        944.050 ms (3 allocs: 3.725 GiB, 21.27% gc time)

@be $X' * $Y samples = 3 evals = 1 seconds = 10
Benchmark: 3 samples with 1 evaluation
        201.744 ms (3 allocs: 39.141 KiB)
        208.619 ms (3 allocs: 39.141 KiB)
        219.143 ms (3 allocs: 39.141 KiB)

It looks like very impressive (also about allocation), even more when q = 10:

q = 10 ; Y = rand(n, q)
 
@be g1($X, $Y, $v) samples = 3 evals = 1 seconds = 10 
Benchmark: 3 samples with 1 evaluation
        1.138 s (6 allocs: 3.725 GiB, 17.87% gc time)
        1.138 s (6 allocs: 3.725 GiB, 16.15% gc time)
        1.168 s (6 allocs: 3.725 GiB, 19.48% gc time)

@be g2($X, $Y, $v) samples = 3 evals = 1 seconds = 10
Benchmark: 3 samples with 1 evaluation
        199.787 ms (12 allocs: 117.625 KiB)
        201.121 ms (12 allocs: 117.625 KiB)
        212.319 ms (12 allocs: 117.625 KiB)

I will think how to adapt it in my applications, many thanks.