Computing QR decomposition many times in parallel

coviktor · January 26, 2022, 5:11pm

I’m new to GPU programming and don’t understand many details. Let’s say I have 2 matrices

A1 = CUDA.rand(1000,1000)
A2 = CUDA.rand(1000,1000)

and I would like to compute their QR decompositions on GPU in parallel (using CUDA.qr). For example, the code below

Q1,R1 = CUDA.qr(A1)
Q2,R2 = CUDA.qr(A2)

does it sequentially. Is there an easy way to do this in parallel?

stillyslalom · January 26, 2022, 6:05pm

Unlike matrix multiplication, matrix factorizations aren’t easily parallelizable using GPU hardware, and CUDA.qr already exploits device parallelism. For a large number of small input matrices, you may see some benefit by moving to a batched factorization (available in CUDA.jl as CUBLAS.geqrf_batched), but that won’t yield any gains when you only have two large input arrays.

coviktor · January 27, 2022, 4:53pm

Thanks, batched factorization for 100 of matrices of size 1000x1000 is faster than doing it sequentially using for loop.

Topic		Replies	Views
Is wrapping GPUQREngine anywhere on the agenda? Numerics gpu	4	884	September 9, 2017
Which algorithm does Julia use for matrix QR decomposition? New to Julia factorization , matrix	5	580	December 13, 2023
Sparse LU factorization on GPU GPU linearalgebra , factorization	12	510	November 2, 2024
Batched LU solves (or Factorizations) with Sparse Matrices GPU	6	639	April 15, 2024
Eigenvalues for lots of small matrices, GPU batched vs CPU eigen GPU	0	1401	November 26, 2020

Computing QR decomposition many times in parallel

Related topics