NVIDIA Tensor Cores not useful for double-precision simulations?

Palli · November 18, 2020, 7:38pm

EDIT (in the end what you want is similar to): I do see these two non-default options CUBLAS_COMPUTE_32F_FAST_16F in

Allows the library to use Tensor Cores with automatic down-conversion and 16-bit half-precision compute for 32-bit input and output matrices.

and CUBLAS_COMPUTE_32F_FAST_16BF that is interesting, but you are after similar for double-single down-convertion (or double-single-half) that I do not see available. I only see CUBLAS_COMPUTE_64F (default) and CUBLAS_COMPUTE_64F_PEDANTIC.

This might change, since the idea is the same, and I think the new Apere hardware could support such a changed library.

Starting with cuBLAS version 11.0.0, the library will automatically make use of Tensor Core capabilities wherever possible, unless they are explicitly disabled by selecting pedantic compute modes in cuBLAS (see cublasSetMathMode(), cublasMath_t).

It should be noted that the library will pick a Tensor Core enabled implementation wherever it determines that it would provide the best performance.

I googled FFT and Tensor Cores and found lots of results, e.g. paper: “Optimizing the Fast Fourier Transform using MixedPrecision on Tensor Core Hardware”. Such is possible, using the above, for FP32 and “This work paves the way for using tensor cores for high precision inputs.”

I first found:

The Fourier transforms of this algorithm can be computed relatively fast using the fast Fourier transform (FFT). The split-step Fourier method can therefore be much faster than typical finite difference methods.[5]

from the paper above:

Implementing the FFT on the graphics card is a relatively
straightforward process simplified by utilizing the commonly
used cuBLAS library API.
The algorithm consists of 3 major arithmetic operations:
splitting FP32 numbers into two FP16 numbers, transposing
matrices, and multiplying matrices. Customized kernels are
written for the splitting operation and the transpose operation.
The matrix multiplication is computed using the CublasGem-
mEx and CublasGemmStridedBatch functions.
Of the three operations, only the matrix multiplication
operation utilizes the tensor core hardware.

Topic		Replies	Views
How much faster is GPU compare to CPU GPU	16	26976	November 24, 2018
GPU compute & high precision general questions New to Julia gpu , cuda , opencl	19	3487	December 30, 2021
How to get started with GPU programming? OpenCL or CUDA? GPU	7	7300	August 29, 2017
Questions on a number of code acceleration techniques General Usage performance , hpc , parallel	11	1807	July 8, 2017
Performance comparison of Nvidia A100, V100, RTX2080Ti Performance gpu , cuda	17	5426	June 14, 2021

NVIDIA Tensor Cores not useful for double-precision simulations?

Related topics