GPU compute & high precision general questions

Yesterday, for my birthday, my wife took me to MicroCenter and gave me her blessing to build a replacement for my 2010 Mac Pro. Soon I had a cart loaded with a 16 core Ryzen 9, case, memory …and in a fit of irrational optimism, I tossed in a 1200 watt power supply and a motherboard that could easily accomodate 2 large graphics cards.
I’ve awakened with a MicroCenter hangover … I’ve never programmed a GPU before! What was I thinking? I can’t even buy a GPU because of the shortage! Assuming that one day GPUs return to the marketplace, like toilette paper did this year, I wonder if you might offer advice for someone interested in GPU compute…
— My computation is primarily FP64. AMD appears to offer more cores for that sort of computation over NVIDIA. However, CUDA seems like the more useful language - and the Julia CUDA tool appears better developed. Thoughts on OpenCL vs CUDA?
— I also need 128 bit or better precision. Is it naive to think I might use GPUs for such work?
— I’m also leaving OSX … I grabbed a copy of Windows 10 … but I recall(?) reading a thread suggesting Linux would be better in GPU applications…?

Most consumer-grade GPUs will have a limited number of Float64 compute units, and I wouldn’t be surprised if your Ryzen 9 outstrips the Float64 performance of any reasonably-priced GPU.

To my knowledge, there’s only been one effort to provide extended-precision support for GPUs, and it’s experimental and may be out of date: GitHub - lumianph/gpuprec: gpuprec: Extended-Precision Libraries on GPUs

If you’ve never done GPU computing before, Nvidia’s ecosystem has much better documentation and will be easier to pick up, but given your goals, I’m not sure that a consumer-grade GPU will do much for you.

Regarding OS choice, Linux almost always has the most robust developer tooling support, but is not the most ergonomic for day-to-day usability. I’ve been happy running Windows 10 with Windows Subsystem for Linux, which lets you easily switch between the two operating systems as needed.

3 Likes

Thank you. I agree about the Ryzen, that’s why I decided on a new machine rather than just an external GPU (and my dear Apple makes the Pro harder to run every year). I offer for your consideration the Radeon Pro VII being introduced in June with 6.5 TFLOPS in FP64 … I’m not sure how translates to my application, but it sure seems promising. At $2K, it’s not exactly reasonably priced; however, the older Radeon VII, $700, if to be found, has FP64 of 3.5 TFLOPS. I’d like to find at least one of those…

Oh - and Windows Subsystem for Linux looks perfect!

You may be able to find a used Nvidia Tesla K80 for $300-400. It won’t deliver cutting-edge performance (~1.9 TFLOPS F64), but CUDA has better learning materials, and the cost is more reasonable for a component where you’re not yet sure whether it’ll be useful for your workloads.

2 Likes

FYI - K80s can be found on Amazon for $299, which is great, but apparently they have no love for non-workstation motherboards and have other cooling and compatibility issues. I’ll just get a basic NVIDIA that’s available for proof of concept.

Nvidia just announced to release consumer GPUs with „low hashrate“ where they block crypto mining. So hopefully prices will drop soon and supply gets better.

2 Likes

I also need 128 bit or better precision. Is it naive to think I might use GPUs for such work?

Yes I think so. Despite seeing (I think this is misleading/misunderstanding): https://www.nvidia.com/en-us/geforce/forums/off-topic/25/89829/128-bit-supercomputing-using-gpus/

Eventually, Nvidia’s research into 128-bit supercomputing will trickle back down to PCs in the form of real-time ray-tracing and 128-bit computing on the personal computer.

64-bit is clearly second-class for CUDA GPUs, and more so over time, with 16-bit 128/32=4 times faster than 6-bit in CUDA 7.0, but 256/2= 128x faster in CUDA 8.6 (see table 3).

I.e 16-bit floats have best performance (256 units of Throughput).

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.

16 bytes is 128-bit, but it probably refers to load/store pair of 64-bit, as in ARMv8.

Sub-byte WMMA operations provide a way to access the low-precision capabilities of Tensor Cores. They are considered a preview feature i.e. the data structures and APIs for them are subject to change and may not be compatible with future releases

No GPU or other hardware, that I know of (except for a few CPUs) support 128-bit floats (and no hardware supports 256-bit floats):

Quadruple-precision (128-bit) hardware implementation should not be confused with “128-bit FPUs” that implement SIMD instructions

The trend for GPUs and CPUs is to smaller:

faster AI inference for FP32, BFloat16 and INT8

I’ve misplaced their long documentation, but they actually defined 4-bit matrix multiply too (I believe a first for a CPU). And the also add 128-bit integers (e.g. compare, also a first I think, i.e. not just SIMD registers), and are now variable-length (the 32-bit-length, plus now 64-bit instructions), not like RISC applied too well before (or any modern chip).

Intriguingly 36-bit computers, at least Honeywell from the 1960s, supported more precision (and if I recall also larger integers, i.e. 72-bit) than e.g. ARM CPUs and it seems even x86’s (by now lesser performant) 80-bit float format, and all GPUs (Honeywell’s double add instruction takes twice the time of its single-precision add):

supported floating point in both 36-bit single-precision and 2 x 36-bit double precision, the exponent being stored separately, allowing up to 71 bits of precision (one bit being used for the sign).

1 Like

I believe MultiFloats.jl should work on GPUs.

Thank you for your thoughtful response. I guess I’ll just have to rely on my CPU for the higher precision. As an aside about CPUs, I sure wish 80 bit precision were allowed in Julia.

Interesting… though : Some rounding error issue between CPU & GPU · Issue #23 · dzhang314/MultiFloats.jl · GitHub
Perhaps Mr. Zhang will hear my plea…

1 Like

And thanks for the heads up about MultiFloats.jl… I’ve been happily using BigFloat, which is said to be 56x slower. Even if I get a fraction of this speed up, I’ll be thrilled!

1 Like

Probably (since GPUs are also Turing-complete, all [Julia] code should work in theory?).

While claimed “fast”: GitHub - dzhang314/MultiFloats.jl: Fast extended-precision floating-point arithmetic for Julia

At 100-bit precision, MultiFloats.jl is roughly 40x faster than BigFloat and 2x faster than DoubleFloats.jl .

These claims are for CPUs, and I very much doubt you get the same speedup on GPUs. Besides, do Julia’s BigFloats (or any other implementation) work on GPUs? Seems not according to the closed issue link below (could only in theory work).

I would look at all the competing options, possibly even posits, for GPUs, to see which package is fastest (and without rounding issue) if you can’t do away with lesser precision. See e.g. here:

Agreed. To date all of my work is CPU, and my week long runs would be much more fun if they were hours long. I’ll take your advice about the various options. It’s a very obvious and simple optimization that I just didn’t think to investigate. If I could do that a couple times, maybe the GPU effort would become moot.

Note, even with GPUs not having the 128-bit or higher data types, it’s interesting what’s already done:

Many technical and HPC applications require high precision computation with 32-bit (single float, or FP32) or 64-bit (double float, or FP64) floating point, and there are even GPU-accelerated applications that rely on even higher precision (128- or 256-bit floating point!). But there are many applications for which much lower precision arithmetic suffices.

Kerr black hole space-time. More specifically, this application involves a hyperbolic partial-differential-equation solver that uses high-order finite-differencing and quadruple (128-bit or ∼30 decimal digits) or octal (256-bit or ∼60 decimal digits) floating-point precision. Given the computational demands of this high-order and high-precision solver, in addition to the rather long evolutions required for these studies, we accelerate the solver using a many-core Nvidia graphics-processing-unit and obtain an order-of-magnitude speed-up over a high-end multi-core processor.

So it’s at least possible and faster (in some cases, we don’ know how much they tried to optimize the CPU code, or if recent large CPUs could compete better). I don’t know what you’re doing, if “differential-equation solver”, I would look into reusing that one, or whatever is out there or possibly lower-level libraries they may build on. See what’s out there, that’s probably not already wrapped by Julia, while should be possible. It’s likely faster than Julia code made and optimized for CPUs.

Thank you …looks like I have some reading to do. The math I do is pretty simple - adding sequential fractions raised to small powers. So, along the lines of,
1/(10^13+0)^0.83+ 1/(10^13+1)^0.83+1/(10^13+2)^0.83+… for real numbers and complex. Brute force math, not elegant differentials.

Sorry to necropost like this, but if Multifloats.jl works well on GPUs as is, would it not make sense to copy-paste their Float64xN types to make Float32xN that make up for the egregious crippling of FP64 performance in commercial GPUs? Of course, this won’t get past the lack of ECC (which AMD is surprisingly good about) whenever that may be necessary.

1 Like

if you want more than 60 bits of precision you still probably want to use float64 based multifloats since multiplication is quadratic in number of elements.

also float32 gives you a hard floor of about 2^-150 which makes them less than great when chained together.

Sure, but float64 FLOPS on any of Nvidia’s consumer or even workstation cards are 1/64 or 1/32x that of float32 operations. So, unless everyone who wants to do FP64 computing jumps over to Radeon Pro VII’s or forks over $5-10K for a Tesla card, access to a simple single-single GPU library could help people who can’t afford supercomputer hardware.