I’m looking out specifying a cluster for our IT guys. I’m not a GPU guru and am not up to date on all of the choices. I see there are a lot of options and some GPUs are more optimized for machine learning, etc.
I’d like to be able to accelerate machine learning operations, but I’d also be able to accelerate differential equation and ray tracing physics based simulations. Do packages like diffeqGPU, etc. have a preference on what type of Nividia GPU would make a better impact.
Since this is a engineering asset meant to experiment with lots of different approaches to high fidelity simulation and some of it may involve scientific model based image generation as well as differential equations based physics models as well as neural network automatic differentiation training, is there anything particular I should be looking at?
GPUs are not so different from CPUs. Model numbers only real importance is to tell you about what generation it is (release date) and where it falls in the product stack. Three most important things to actually look at (my opinion):
number of cores
speed of cores
amount of memory (VRAM)
The more the better, but the budget gets exhausted pretty quick. If possible, an alternative is to just rent cloud GPU compute time.
Indeed for DiffEqGPU style things, I’d look less at the specific core type and more on “will it fit what I want to put on it?”. What is the required amount of RAM and the number of concurrent solves you need? You can guesstimate that the solver needs roughly 10 vectors of cache for the non-stiff methods, a Jacobian for the stiff ones, and then vectors for every saved output. That gives a rough size per solve, and then multiply the number of concurrent solves. Does that fit into VRAM? Will you need to batch it? The less batches you need to run the generally faster it will be. Newer chips will mostly just have more cores and VRAM, the speed of the cores doesn’t differ all that much in comparison so this back of the envelope calculation is rather decent.
If your workload includes a lot of stiff problems or tight accuracy requirements, you may want the expensive server-class (Tesla) cards John mentioned for efficient double precision (DP). Workstation-class Quadro/RTX cards are about 16x slower for DP, but a better buy for most machine learning, ray-tracing, and image formation. The phrases “high fidelity” and “scientific model” in your query suggest you should avoid cheaper cards which don’t have error-correcting memory but are fine for a lot of neural-network projects.
Let me put in my 2 cents worth here. You are specifying a GPU cluster. This is where vendor presales folks come in. Seriously - this should be their bred and butter.
Contact specialist HPC companies in the region you are in and ask for their advice and proposals. Contact what are called Tier 1 vendors also and ask about their product lines.
You will soon find a “trusted advisor” who you like.
Disclaimer - I work for Dell and install GPU clusters.
Second of my 2 cents. GPUs consume a lot of power. Consider where you are going to physically host these servers. Again ask presales what the expected power consumption at load is for each server. Does your location have enough electrical current available? Can you supply enough air cooling ?
Quite often departments want to house kit like this in a comms cupboard or utility room at the end of the corridor. This is fine - but check out the electrical load you can draw and how much AC capacity is in that room.
If you dont have capacity you will be looking at hosting in your campus data centre or a local hosting facility.
Disclaimer - Water cooled rear doors, Direct Liquid Cooling and immersion cooling are options in any new procuremnt with high power density.
Indeed! I had missed that part in the OP and thought he was just asking about a single consumer card. For a compute cluster, there are plenty of additional things to consider beyond what I listed before. What CPUs will you pair these GPUs with? How will it all (including storage) be connected? You want to think a bit more holistically if it is a cluster. Probably best to start looking at mostly pre-built systems.
Well we have the skills here to find out…
Actually I would imagine the hyperscaler cloud companies plus LLM models these days outdo HPC style supercomputer clusters.