I’m new to GPU programming and don’t understand many details. Let’s say I have 2 matrices

```
A1 = CUDA.rand(1000,1000)
A2 = CUDA.rand(1000,1000)
```

and I would like to compute their QR decompositions on GPU in parallel (using CUDA.qr). For example, the code below

```
Q1,R1 = CUDA.qr(A1)
Q2,R2 = CUDA.qr(A2)
```

does it sequentially. Is there an easy way to do this in parallel?

Unlike matrix multiplication, matrix factorizations aren’t easily parallelizable using GPU hardware, and `CUDA.qr`

already exploits device parallelism. For a large number of small input matrices, you may see some benefit by moving to a batched factorization (available in CUDA.jl as `CUBLAS.geqrf_batched`

), but that won’t yield any gains when you only have two large input arrays.

Thanks, batched factorization for 100 of matrices of size 1000x1000 is faster than doing it sequentially using for loop.