EDIT: New Ampere Nvidia GPUs do have Tensor Cores capable of double-precision but see my follow-up post(s) on it, and why you still do not want to be limited to double. [Trivia: The current top supercomputer, according to Nov. 2020 TOP500 was updated, is now 3x faster than 2nd, Summit, and over over 5x faster on other benchmark, an AI benchmark using mixed precision, clearly the way to go with lots of different hardware (not just Nvidia), thus 2 exaflops, using only ARM-chips.]
Very likely Tensor cores in Nvidia chips do not have double-precision (and neither Google’s TPU, that I believe are similar), and the absence may not be important as as I wouldn’t rule out mixed 16-, 32-, 64-bit computation as done in some cases with help of non-Tensor cores.
In theory some future chips with Tensor cores could have double-precision capability while the trend is in the other direction, smaller data types, rather more cores and more efficient use of memory, with just released Nvidia GPUs now close(er) to 6000 CUDA cores than to 5000 (and just release one with just released one with 80 GB of memory).
You may not need double-precision in the Tensor cores, here reduction with CPUs using double-precision is compared to Nvidia where the Tensor cores do not have that capability (but rest of the chip has it and the CUDA cores are used to, it seems to me for the double-precision capability):
The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena. […] One important non-Machine Learning computational pattern is the arithmetic reduction [35], which is one of the most used patterns in science and technology, i.e., it is the discrete integration tool for modelling many scientific phenomena, from n-body/Monte Carlo simulations [2], [27], cellular automata [36] to map-reduce workloads [34] and ray tracing [12], among many others.
In terms of numerical error, in the normal distribution [μ = 0, σ² = 1] test (bottom left) all variants present less than 1% of numerical error with respect to the CPU reduction, once the input size is n ≥ 10 × 10⁶ numbers.
Today, the Nvidia Volta GPU Tesla V100, Quadro V100 and Titan V all include around 640 tensor cores, and they can offer up to 120 TFLOPS in mixed FP16-FP32 precision. In comparison, the traditional CUDA cores, which are 5120 in total for the GPUs recently mentioned, offer up to ∼ 15 TFLOPS of performance in FP32 precision and around ∼ 7 TFLOPS in FP64 precision.
As of January 2020, GPUs contain up to 640 tensor cores that can work in parallel. Each tensor core is a hardware-implemented function that performs a matrix multiply accumulate (MMA) operation of 4 × 4 matrices in one GPU clock cycle
adapting any arbitrary algorithm to a tensor core scheme is not a trivial task, as tensor cores are different from regular GPU cores. While GPU cores are capable of executing a whole instruction set (i.e., the instructions used in a regular CUDA/OpenCL program), tensor cores are capable of executing one operation but significantly faster; a matrix multiply accumulate (MMA) over 4 × 4 matrices, in one GPU clock cycle.
The three variants are compared regarding two aspects: (1) speedup over a classic warp-shuffle reduction (does not use tensor cores, just regular CUDA FP32 cores) and (2) numerical error with respect to a CPU reduction using double precision. The tests were run on a TESLA V100 GPU, and additional performance results using a TITAN RTX can be found in Appendix B. Note: the fastest variant found in this subsection is then compared with Nvidia’s CUB library in Section 6. Figure 7 shows the speedup of all variants with respect to a warp-shuffle reduction as well as their numerical error with respect to a CPU reduction in FP64 mode.
The tensor core programming model exposes a single operation to the programmer, the matrix-multiply-accumulate (MMA). That is, given three matrices A, B, C , the MMA operation computes
D = A × B + C
In one GPU cycle. The tensor core computing model allows many MMA operations to occur simultaneously in parallel. It is interesting to note that in the programming model the tensor core MMA operation is exposed in terms of m × n × k and allows the definition of matrices of size 16 × 16 to the programmer, even when the actual operation at hardware level is carried in terms of 4 × 4 matrices. The process of splitting the 16 × 16 workload into smaller 4 × 4 works is done automatically by the GPU scheduler, but splitting a large problem of size n into several 16 × 16 matrices is not automatic and must be designed manually.
Julia does have:
that allows for a trick to use double numbers in pairs to increase precision, or it could be two single, but it seems impossible to use it with the Tensor cores (but you would never only use them), maybe not the rest of the GPU in general.