Computing QR decomposition many times in parallel

I’m new to GPU programming and don’t understand many details. Let’s say I have 2 matrices

A1 = CUDA.rand(1000,1000)
A2 = CUDA.rand(1000,1000)

and I would like to compute their QR decompositions on GPU in parallel (using CUDA.qr). For example, the code below

Q1,R1 = CUDA.qr(A1)
Q2,R2 = CUDA.qr(A2)

does it sequentially. Is there an easy way to do this in parallel?

Unlike matrix multiplication, matrix factorizations aren’t easily parallelizable using GPU hardware, and CUDA.qr already exploits device parallelism. For a large number of small input matrices, you may see some benefit by moving to a batched factorization (available in CUDA.jl as CUBLAS.geqrf_batched), but that won’t yield any gains when you only have two large input arrays.

Thanks, batched factorization for 100 of matrices of size 1000x1000 is faster than doing it sequentially using for loop.