cuSOLVER: two calls to cusolverDnDgesvdj_bufferSize, one via Juila, the other via CUDA yield (very) different results

FHulsemann · February 14, 2025, 1:05pm

Hello. I was wondering whether someone could explain to me why the working space for an SVD (gesvdj to be precise) differs wildly between a call to cusolverDnDgesvdj_bufferSize via the Julia interface and a call via CUDA directly.
When trying to compute an economical SVD via Julia’s CUDA interface, I have difficulty to understand how a SVD of a 1 000 000 x 2 matrix (of Float64) could require almost 1 GB in workspace. The call to bufferSize() in dense.jl results in a request of
1040136960 bytes
(if my println in dense.jl is to be trusted). Outside Julia, when calling the same function, with the same input (and i32 ints as arguments), via the CUDA interface directly, the requested workspace amounts to
16000256 bytes.
In my user code, I call Julia’s svd! function, which turns out to call CUDA’s gesvdj, just to say that I did not force the call to any particular CUDA function nor via the by now apparently deprecated CUDA interface.
I’m on Julia v1.11.3, with CUDA installed via pkg (CUDA runtime 12.6, artifact installation, CUDA driver 12.6, CUSOLVER: 11.7.1, Julia CUDA package 5.6.1, CUDA_Driver_jll: 0.10.4+0, CUDA_Runtime_jll: 0.15.5+0)
Any insights would be greatly appreciated, as the target size of the matrices is more in the range of 10^7, 10^8 rows and between 100 and 1000 columns.
Best regards, Frank

amontoison · August 19, 2025, 9:26am

@FHulsemann I am sorry that nobody answered your question earlier but I can answer.

With the new generic API (64-bit integer), CUSOLVER changed what is returned by the “buffer_size” functions.
Before it was the number of element in a given array of type T where T is Float32, Float64, ComplexF32, Complex64.
It is what is returned by cusolverDnDgesvdj_bufferSize where T = Float64 (legacy API).

With the new API, the buffer size is in bytes because we only have one function for the buffer size of all precisions (cusolverDnX…)
To more easily reuse buffers between all routines, we allocate buffers of `CuArray{UInt8}, which can be used for the old and new API and easily reused for different precisions.

It is why you have a factor 8 between the Julia and C versions. A Float64 requires 8 bytes so we need a vector of CuArray{UInt8} that is 8 times longer than a CuArray{Float64} to store the same content.

You can see here what we do:

github.com/JuliaGPU/CUDA.jl

lib/cusolver/dense.jl

master


      
          params  = Ref{gesvdjInfo_t}(C_NULL)
          cusolverDnCreateGesvdjInfo(params)
          cusolverDnXgesvdjSetTolerance(params[], tol)
          cusolverDnXgesvdjSetMaxSweeps(params[], max_sweeps)
          dh = dense_handle()
          
          function bufferSize()
              out = Ref{Cint}(0)
              $bname(dh, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv,
                     out, params[])
              return out[] * sizeof($elty)
          end
          
          with_workspace(dh.workspace_gpu, bufferSize) do buffer
              $fname(dh, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv,
                     buffer, sizeof(buffer) ÷ sizeof($elty), dh.info, params[])
          end
          
          info = @allowscalar dh.info[1]
          chkargsok(BlasInt(info))

Julia and C are consistent, the only difference is that we compute the number of bytes in Julia while C returns the number of coefficients for a given precision.

FHulsemann · August 22, 2025, 5:09pm

Dear @amontoison, thank you very much for taking the time for your explanation. As you have noticed, I stumbled across that question some time ago. The buffersize demands limited me to small data sets on the GPU (10 000 000 lines and 15 columns in F64 were already too much on a 4GB card: ERROR: LoadError: Out of GPU memory trying to allocate 1.118 GiB ), whereas on the CPU side, the LAPACK routine dgesvd was happy with 80 MB workspace for the same problem size. I came to the conclusion that tall, skinny matrices were no good match for my GPU setup, which seems to be known in the GPU circles, given that there are publications that describe how to avoid exactly this setting. Well, back to the drawing board.

Topic		Replies	Views
Ccall: could not find function ** in library /usr/local/cuda-7.0/lib64/libcusolver.so.7.0.28 GPU	1	789	March 28, 2018
svdvals is alarmingly slow Performance	31	3049	March 15, 2018
SVD solve with CUSOLVER GPU first-steps	2	2743	June 11, 2019
SVD: Better default to gesvd! instead of gesdd!? Internals & Design linearalgebra	21	5193	February 14, 2019
CUDA.jl and svd getting automatic casting to Array{Float32,2} General Usage cuda	1	759	May 18, 2021

cuSOLVER: two calls to cusolverDnDgesvdj_bufferSize, one via Juila, the other via CUDA yield (very) different results

Related topics