EDIT (in the end what you want is similar to): I do see these two non-default options CUBLAS_COMPUTE_32F_FAST_16F in
Allows the library to use Tensor Cores with automatic down-conversion and 16-bit half-precision compute for 32-bit input and output matrices.
and CUBLAS_COMPUTE_32F_FAST_16BF that is interesting, but you are after similar for double-single down-convertion (or double-single-half) that I do not see available. I only see CUBLAS_COMPUTE_64F (default) and CUBLAS_COMPUTE_64F_PEDANTIC.
This might change, since the idea is the same, and I think the new Apere hardware could support such a changed library.
Starting with cuBLAS version 11.0.0, the library will automatically make use of Tensor Core capabilities wherever possible, unless they are explicitly disabled by selecting pedantic compute modes in cuBLAS (see cublasSetMathMode(), cublasMath_t).
It should be noted that the library will pick a Tensor Core enabled implementation wherever it determines that it would provide the best performance.
I googled FFT and Tensor Cores and found lots of results, e.g. paper: “Optimizing the Fast Fourier Transform using MixedPrecision on Tensor Core Hardware”. Such is possible, using the above, for FP32 and “This work paves the way for using tensor cores for high precision inputs.”
I first found:
The Fourier transforms of this algorithm can be computed relatively fast using the fast Fourier transform (FFT). The split-step Fourier method can therefore be much faster than typical finite difference methods.[5]
from the paper above:
Implementing the FFT on the graphics card is a relatively
straightforward process simplified by utilizing the commonly
used cuBLAS library API.
The algorithm consists of 3 major arithmetic operations:
splitting FP32 numbers into two FP16 numbers, transposing
matrices, and multiplying matrices. Customized kernels are
written for the splitting operation and the transpose operation.
The matrix multiplication is computed using the CublasGem-
mEx and CublasGemmStridedBatch functions.
Of the three operations, only the matrix multiplication
operation utilizes the tensor core hardware.