I am using CUDA.jl to speed up double-precision simulations with large matrix computations and FFT (e.g. Split Step Fourier Method).

I heard about Tensor Cores on NVIDIA GPUs and have the following questions:

Tensor Core seems to target Machine Learning (in particular Neural Network) applications in speeding up 16-bit floating number calculations. Does that mean it has no use for double-precision simulations?

For a GPU with Tensor Cores that I don’t use, are those processing units on hardware still being used as CUDA Cores? Or are they just sitting idle?

Are there any Julia libraries that make use of Tensor Cores?

EDIT: New Ampere Nvidia GPUs do have Tensor Cores capable of double-precision but see my follow-up post(s) on it, and why you still do not want to be limited to double. [Trivia: The current top supercomputer, according to Nov. 2020 TOP500 was updated, is now 3x faster than 2nd, Summit, and over over 5x faster on other benchmark, an AI benchmark using mixed precision, clearly the way to go with lots of different hardware (not just Nvidia), thus 2 exaflops, using only ARM-chips.]

Very likely Tensor cores in Nvidia chips do not have double-precision (and neither Google’s TPU, that I believe are similar), and the absence may not be important as as I wouldn’t rule out mixed 16-, 32-, 64-bit computation as done in some cases with help of non-Tensor cores.

In theory some future chips with Tensor cores could have double-precision capability while the trend is in the other direction, smaller data types, rather more cores and more efficient use of memory, with just released Nvidia GPUs now close(er) to 6000 CUDA cores than to 5000 (and just release one with just released one with 80 GB of memory).

You may not need double-precision in the Tensor cores, here reduction with CPUs using double-precision is compared to Nvidia where the Tensor cores do not have that capability (but rest of the chip has it and the CUDA cores are used to, it seems to me for the double-precision capability):

The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena. […] One important non-Machine Learning computational pattern is the arithmetic reduction [35], which is one of the most used patterns in science and technology, i.e., it is the discrete integration tool for modelling many scientific phenomena, from n-body/Monte Carlo simulations [2], [27], cellular automata [36] to map-reduce workloads [34] and ray tracing [12], among many others.

In terms of numerical error, in the normal distribution [μ = 0, σ² = 1] test (bottom left) all variants present less than 1% of numerical error with respect to the CPU reduction, once the input size is n ≥ 10 × 10⁶ numbers.

Today, the Nvidia Volta GPU Tesla V100, Quadro V100 and Titan V all include around 640 tensor cores, and they can offer up to 120 TFLOPS in mixed FP16-FP32 precision. In comparison, the traditional CUDA cores, which are 5120 in total for the GPUs recently mentioned, offer up to ∼ 15 TFLOPS of performance in FP32 precision and around ∼ 7 TFLOPS in FP64 precision.

As of January 2020, GPUs contain up to 640 tensor cores that can work in parallel. Each tensor core is a hardware-implemented function that performs a matrix multiply accumulate (MMA) operation of 4 × 4 matrices in one GPU clock cycle

adapting any arbitrary algorithm to a tensor core scheme is not a trivial task, as tensor cores are different from regular GPU cores. While GPU cores are capable of executing a whole instruction set (i.e., the instructions used in a regular CUDA/OpenCL program), tensor cores are capable of executing one operation but significantly faster; a matrix multiply accumulate (MMA) over 4 × 4 matrices, in one GPU clock cycle.

The three variants are compared regarding two aspects: (1) speedup over a classic warp-shuffle reduction (does not use tensor cores, just regular CUDA FP32 cores) and (2) numerical error with respect to a CPU reduction using double precision. The tests were run on a TESLA V100 GPU, and additional performance results using a TITAN RTX can be found in Appendix B. Note: the fastest variant found in this subsection is then compared with Nvidia’s CUB library in Section 6. Figure 7 shows the speedup of all variants with respect to a warp-shuffle reduction as well as their numerical error with respect to a CPU reduction in FP64 mode.

The tensor core programming model exposes a single operation to the programmer, the matrix-multiply-accumulate (MMA). That is, given three matrices A, B, C , the MMA operation computes

D = A × B + C

In one GPU cycle. The tensor core computing model allows many MMA operations to occur simultaneously in parallel. It is interesting to note that in the programming model the tensor core MMA operation is exposed in terms of m × n × k and allows the definition of matrices of size 16 × 16 to the programmer, even when the actual operation at hardware level is carried in terms of 4 × 4 matrices. The process of splitting the 16 × 16 workload into smaller 4 × 4 works is done automatically by the GPU scheduler, but splitting a large problem of size n into several 16 × 16 matrices is not automatic and must be designed manually.

Julia does have:

that allows for a trick to use double numbers in pairs to increase precision, or it could be two single, but it seems impossible to use it with the Tensor cores (but you would never only use them), maybe not the rest of the GPU in general.

For GPUs as with CPUs, if you feed them machine code where one instruction, for one functional unit, is not used I would think it’s idle. E.g. for CPUs it could be for division or for square root. The floating-point unit is for many programs idle, why it’s sometimes shared between cores (e.g. in some AMD chips) to make more use of it.

However, for GPUs, binary machine code is not commonly distributed, rather CUDA assembly used, PTX, and as of Ampere has this one extra instruction for Tensor cores for double-precision multiply. It’s theoretically possible that some other PTX instructions or simply higher level code e.g. C++ CUDA could compile to this new instruction.

What make you think you’re not using the Tensor cores? I would assume Nvidia wants you to use them easily, I find it very likely it’s automatic, or at least through libraries, it seems it wouldn’t need not be either or.

Thank you for your answers! Especially about the MMA operation - I’ll do some more readings (homework) to understand more of your answers.

For the question whether Tensor Core is being used - my typical application involves: (1) Load CUDA.jl or ArrayFire.jl, (2) Convert some large CPU arrays into GPU device arrays defined by the library; (3) Perform library-defined linear algebra function and/or 1D FFT/IFFT on the GPU arrays, using the GPU; (4) Convert the result back to CPU arrays; (5) Rince and repeat. All calculations are done in double precision (64-bit) floating numbers.

Do you think the library (CUDA.jl or ArrayFire.jl) would make use of Tensor Cores in these calculations for me, automatically?

Ampere has double-precision tensor cores. Essentially there’s two ways of using them: explicitly, in kernels, or using libraries. The latter is supported, if you use CUBLAS and configure the CUDA math mode to allow conversions (or appropriately use mixed-precision inputs to e.g. mul!) you’ll be using tensor cores. For the former case, where you use the CUDA.WMMA submodule to explicitly target the tensor cores, we don’t support the Ampere-era double-precision ones yet.

So does this mean double-precision linear algebra calculations utilizing Tensor Cores on on NVIDIA Ampere GPUs are already available in CUDA.jl?

If so, do you mind expand a little? For example, how to configure CUDA.jl to utilize double-precision Tensor Cores (a pointer to a documentation page would do), any minimal version required, is FFT/IFFT supported?

[See in section B. 850,000 core chip, and other alternatives to Nvida.]

A.
If you use double-precision (and do not have Ampere) then no. Still, I would avoid using it (only), even when it’s supported by the hardware (given you’re not worried about losing precision), if you want speed, and you even want to avoid the traditional IEEE single-precision data type. Nvidia’s new alternative to it, TF32, is 8x faster, or 16x faster with the new sparsity option (not available for IEEE single). That substitution may however happen automatically (or as a configuration option). TF32 is up to 32x faster than double.

Even then, you get 19.5 TFLOPS max. (assuming not using it in a mixed-precision fashion) and while that’s a 2.5x speedup over from V100 to A100 (Ampere) and seems nice, you’re losing out on even larger 10x or 20x speedup for lower precision data types with Ampere from compared to V100, e.g. 156 TFLOPS for TF32 performance, or with the new sparsity option (not available for double), 312 TFLOPS for TF32. You’ll see below about one other mixed-precision library, and you opt into such libraries, but I doubt otherwise you would get double substituted for better types at the code level (you really shouldn’t want that).

By halving the size of the datatype, you would expect a doubling in performance, and that happens (even down to 1 bit) again from TF32 to FP16 and for BF16, so up to 624 TFLOPS for either type.

Note, this doubling doesn’t happen from FP64 vs. TF32 where the gap is 16x, while there is a doubling vs FP32.

The trend of doubling or scaling linerally with size (from TF32) goes further than to TF16, for INT8, INT4 types so up to 2496 TOPS or even 4992 TOPS for binary.

I’m quoting number from table 2, page 23, in chapter Third-Generation NVIDIA Tensor Core in NVIDIA A100 Tensor Core GPU Architecture whitepaper and see also next few pages:

The Tensor Core Accelerated Iterative Refinement Solver (TCAIRS) in cuSOLVER automates
usage of mixed precision for this application. […] The same technology used in that study tripled the Summit supercomputer’s performance on the HPL-AI benchmark.

cuSOLVER in CUDA 11.0 adds support for A100’s new tensor core formats including TF32.
Figure 10 and Figure 11 below show results of the TCAIRS solver on 37 tests from the
SuiteSparse Matrix collection, comparing convergence rate and performance for FP32, FP16
with input scaling, BF16, and TF32. These were compared to the performance of the reference
FP64 solver which leverages the FP64 Tensor Cores on the A100. In cases where the mixed-
precision solver automatically falls back to the FP64 solver due to slow or no convergence, the
number of iterations were recorded as negative, and speedup is less than one, as it included the
cost of the failed attempt.

As shown in Figure 10 and Figure 11, TF32 delivered the fastest and most robust results
compared to other Tensor Core modes. The number of iterations to converge was the lowest for
TF32 amongst the Tensor Core modes. While FP32 had one fallback case, TF32 had only two,
compared to three for FP16 with input scaling, and six for BF16 Tensor Core modes. The
geomean speedup over the FP64 solver was 2.0X for TF32 Tensor Cores

and:

The new Double Precision Matrix Multiply Add instruction on A100 replaces 8 DFMA instructions on V100, reducing instruction fetches, scheduling overhead, register reads, datapath power, and shared
memory read bandwidth.

Number of iterations taken by the TCAIRS solver compared for 37 different problems to converge to FP64 accuracy for different precisions.

Elsewhere in the Ampere 82-page paper (Ampere seems very impressive): with lesser-known features for DL/computer vison (not just for JPEG and video): “The GA100 Optical Flow Accelerator is a hardware module that supports both optical flow and stereo disparity estimation at high pixel rates.”

and:

Starting with CUDA 11.0, devices of compute capability 8.0 like A100 have the capability to
influence persistence of data in the L2 cache and set aside a portion of L2 cache for persistent
data accesses, allowing higher bandwidth and lower latency accesses to the global memory.
The capability to influence the persistence of data in the L2 cache allows A100 GPUs to use the
large 40MB L2 cache more efficiently. For example, recurrent weights in many LSTM networks
can be made persistent in L2 and re-used between GEMM operations. A100 allows L2 cache to
be set-aside for persistent accesses in 1/16 th increments (2.5 MB).

[I would like to see Julia software support that.]

and: “NVIDIA A100 GPU provides hardware-accelerated barriers in shared memory. These barriers
are made available in CUDA 11 in the form of ISO C++-conforming barrier objects.”

and: “Compression saves up to 4x DRAM read/write bandwidth, up to 4x L2 read bandwidth, and up to 2x L2 capacity.”

and: “40 MB L2 cache that is almost 7x larger than that of Tesla V100”

B.
Compare that to 18 GB on-chip RAM for Cerabas awesome AI chip that’s the size of your face, 400,000 cores.

Cerebras offers a competitive comparison with Nvidia’s latest GPU, the Ampere A100. “They went to 7nm, we’re still 54 times more cores and 450 times more on-chip memory, 5,000+ times more memory bandwidth, 20,000 times more fabric bandwidth” Feldman stated.

The company says its 400,000 AI-optimized cores (etched on one wafer-scale die) can do the AI training work of 100-1,000 GPUs. The 15 rack-unit CS-1 has a max power draw of 20 kW; pricing has not been publicly disclosed.

The WSE’s 18 Gigabytes of on-chip memory, 9 petabytes per second memory bandwidth, and 100 petabits/second fabric bandwidth drive its differentiation from other AI hardware products.

The 2ng gen chip is also mentioned there (and not on their home page) is down to 7 nm (soame size as you can’t go bigger when at wafer-scale), is 850,000 cores.

To be fair Nvidia has 40 GB of of-chip HBM2 memory, not just 40 MB on-chip cache, and 80 GB in latest version.

using Ubuntu, the first time(s) I see Ubuntu (and CentoOS, 3rd) on Green500/TOP500 list, and next after almost as power-efficient, Japanese chip/interconnect/vendor MN-Core, while much smaller in size.

The Japanese ARM-based supercomputer (fastest in the world, just got faster, still with no GPUs used):

More significantly, Fugaku increased its performance on the new mixed precision HPC-AI benchmark to 2.0 exaflops, besting its 1.4 exaflops mark recorded six months ago. These represents the first benchmark measurements above one exaflop for any precision on any type of hardware.

EDIT (in the end what you want is similar to): I do see these two non-default options CUBLAS_COMPUTE_32F_FAST_16F in

Allows the library to use Tensor Cores with automatic down-conversion and 16-bit half-precision compute for 32-bit input and output matrices.

and CUBLAS_COMPUTE_32F_FAST_16BF that is interesting, but you are after similar for double-single down-convertion (or double-single-half) that I do not see available. I only see CUBLAS_COMPUTE_64F (default) and CUBLAS_COMPUTE_64F_PEDANTIC.

This might change, since the idea is the same, and I think the new Apere hardware could support such a changed library.

Starting with cuBLAS version 11.0.0, the library will automatically make use of Tensor Core capabilities wherever possible, unless they are explicitly disabled by selecting pedantic compute modes in cuBLAS (see cublasSetMathMode(), cublasMath_t).

It should be noted that the library will pick a Tensor Core enabled implementation wherever it determines that it would provide the best performance.

I googled FFT and Tensor Cores and found lots of results, e.g. paper: “Optimizing the Fast Fourier Transform using MixedPrecision on Tensor Core Hardware”. Such is possible, using the above, for FP32 and “This work paves the way for using tensor cores for high precision inputs.”

I first found:

The Fourier transforms of this algorithm can be computed relatively fast using the fast Fourier transform (FFT). The split-step Fourier method can therefore be much faster than typical finite difference methods.[5]

from the paper above:

Implementing the FFT on the graphics card is a relatively
straightforward process simplified by utilizing the commonly
used cuBLAS library API.
The algorithm consists of 3 major arithmetic operations:
splitting FP32 numbers into two FP16 numbers, transposing
matrices, and multiplying matrices. Customized kernels are
written for the splitting operation and the transpose operation.
The matrix multiplication is computed using the CublasGem-
mEx and CublasGemmStridedBatch functions.
Of the three operations, only the matrix multiplication
operation utilizes the tensor core hardware.

As you say the Tensor Cores support double, but as I quote above cuBLAS supports double, but its docs only mention Tensor Cores for lower precision. The docs may not have been updated for Ampere (other docs say the hardware has the capability), or simply this library hasn’t been updated. Or maybe I overlooked something, so I read further.

The researchers then use a combination FP64, FP32, FP21 and FP16

and I thought F21 was a typo, but it isn’t, so is there a plan to support only the datatypes the hardware supports or also (there I suppose is done with bit-twiddling, if we had such Julia-only code, would CUDA.jl automatically support such a type?):

FP21 data type is our proposed floating-point number format, which is composed of 1-bit sign part, 8-bit exponent part, and 12-bit fraction part. FP21 has the advantage of the same dynamic range as FP32 and a better accuracy than FP16.

3 × FP21 numbers are stored into 1 × 64-bit element, so it is suitable for 3-D simulations.

I do see in CUDA.jl’s docs “Note that f32 is only valid for the matrix C.” for CUDA.WMMA.llvm_wmma_load, is that meant for:

Thanks for your detailed replies. I also saw news about Cerebras, the wafer-size super computing chip. It’s a very interesting idea and no doubt some customers would prefer that. Just not sure how widely this can be adopted without the gamers and cryptocoin miners driving down the cost and push the R&D forward, and how sustainable the yield is.