Hello. I was wondering whether someone could explain to me why the working space for an SVD (gesvdj to be precise) differs wildly between a call to cusolverDnDgesvdj_bufferSize via the Julia interface and a call via CUDA directly.
When trying to compute an economical SVD via Julia’s CUDA interface, I have difficulty to understand how a SVD of a 1 000 000 x 2 matrix (of Float64) could require almost 1 GB in workspace. The call to bufferSize() in dense.jl results in a request of
1040136960 bytes
(if my println in dense.jl is to be trusted). Outside Julia, when calling the same function, with the same input (and i32 ints as arguments), via the CUDA interface directly, the requested workspace amounts to
16000256 bytes.
In my user code, I call Julia’s svd! function, which turns out to call CUDA’s gesvdj, just to say that I did not force the call to any particular CUDA function nor via the by now apparently deprecated CUDA interface.
I’m on Julia v1.11.3, with CUDA installed via pkg (CUDA runtime 12.6, artifact installation, CUDA driver 12.6, CUSOLVER: 11.7.1, Julia CUDA package 5.6.1, CUDA_Driver_jll: 0.10.4+0, CUDA_Runtime_jll: 0.15.5+0)
Any insights would be greatly appreciated, as the target size of the matrices is more in the range of 10^7, 10^8 rows and between 100 and 1000 columns.
Best regards, Frank
@FHulsemann I am sorry that nobody answered your question earlier but I can answer.
With the new generic API (64-bit integer), CUSOLVER changed what is returned by the “buffer_size” functions.
Before it was the number of element in a given array of type T where T is Float32, Float64, ComplexF32, Complex64.
It is what is returned by cusolverDnDgesvdj_bufferSize
where T = Float64 (legacy API).
With the new API, the buffer size is in bytes because we only have one function for the buffer size of all precisions (cusolverDnX…)
To more easily reuse buffers between all routines, we allocate buffers of `CuArray{UInt8}, which can be used for the old and new API and easily reused for different precisions.
It is why you have a factor 8 between the Julia and C versions. A Float64 requires 8 bytes so we need a vector of CuArray{UInt8} that is 8 times longer than a CuArray{Float64} to store the same content.
You can see here what we do:
Julia and C are consistent, the only difference is that we compute the number of bytes in Julia while C returns the number of coefficients for a given precision.
Dear @amontoison, thank you very much for taking the time for your explanation. As you have noticed, I stumbled across that question some time ago. The buffersize demands limited me to small data sets on the GPU (10 000 000 lines and 15 columns in F64 were already too much on a 4GB card: ERROR: LoadError: Out of GPU memory trying to allocate 1.118 GiB ), whereas on the CPU side, the LAPACK routine dgesvd was happy with 80 MB workspace for the same problem size. I came to the conclusion that tall, skinny matrices were no good match for my GPU setup, which seems to be known in the GPU circles, given that there are publications that describe how to avoid exactly this setting. Well, back to the drawing board.