Not good: AI is a problem, not a solution

PetrKryslUCSD · August 4, 2024, 1:22am

StevenSiew · August 4, 2024, 2:17am

So thats the reason I don’t have AI programs on my PC as a hobby. Takes too much floating point operations to “train” them. At these rate, you need to have the financial resources of Bruce Wayne to train AI on your PC.

photor · August 4, 2024, 10:40am

The point is, you just train it once, and then you can use it many times without training, which doesn’t consume much energy.

mcreel · August 4, 2024, 1:49pm

Training a model requires evaluating it many times on many different inputs. Using the model requires evaluating it one time with a given input. But, once it’s released to the world, a lot of people are submitting inputs. So, I would speculate that a lot of energy is being used. Probably, a search now uses more energy than a search several years ago.

mbaz · August 4, 2024, 2:58pm

Honest question: how many times are models trained before they are released?

PetrKryslUCSD · August 4, 2024, 4:24pm

StevenSiew · August 4, 2024, 5:35pm

Does that means if my CPU have multiple teraflops, I will be safe to train and use AI on my PC???

GunnarFarneback · August 4, 2024, 8:23pm

Putting a query to an AI system typically requires roughly the square root of the amount of computing used to train it.

I had never heard that claim before. Can someone point to a reference or give a hand-wavy or better rationale for why that should be the case?

mcreel · August 5, 2024, 12:55pm

Chatgpt doesn’t want to discuss the current model’s energy use, but it does tell us something about older models:

“Studies have estimated that training large models like GPT-3 could consume hundreds of megawatt-hours (MWh) of energy.”

“A typical inference (i.e., generating a response to a query) using a large model like GPT-3 might consume around 0.1 to 1 Wh of energy.”

So, a single query uses much less than the square root of the energy used to train the model. That is in accordance with my own experience with neural nets. A model can take on the order of a day to train on a GPU, but will evaluate an input in time on the order of seconds.

Training these big models is certainly very expensive, and each query is certainly more expensive than a query to a model from 10 years ago. On the other hand, the results may be better, so perhaps the incentives to make many queries are less.

ianfiske · August 5, 2024, 1:16pm

The reference is clearly incorrect as stated. But I’m guessing that it meant to say that the computational load of a single evaluation of the loss function during training is the square number of the simple eval of the model in query-time. It’s just comparing doing the full autodiff gradient with the simple eval. So single-eval vs single-loss-with-gradient (not looking at the full training – that’s crazy).

stevengj · August 5, 2024, 2:08pm

That sounds wrong to me. The cost of computing the gradient with backpropagation / reverse mode is only roughly double the cost of the forward evaluation. Counter-intuitively, the cost of a gradient doesn’t scale proportionally to the number of parameters.

Tamas_Papp · August 6, 2024, 8:01am

Looking at the graph, I am wondering if there is a theoretical motivation for the break at 10 on the x axis.

If not, then it is fair game to fit whatever kind of curve on these points. Visually the break is not apparent, and — purely from a statistical standpoint — arbitrary regression discontinuity is notoriously unreliable for prediction and inference.

alemelis · August 7, 2024, 3:27pm

the break is there because of AlexNet in 2012. This is usually considered the beginning of deep learning

prbzrg · August 9, 2024, 9:25pm

This reminds me of a paper that I read years ago:

Tamas_Papp · August 10, 2024, 7:54am

Interesting proposal, but since electricity, hardware, and the associated labor costs are so large for “Red AI”, I imagine that if there were low hanging fruits for decreasing any of them and still keep models useful (“Green AI”), the economic incentives would have encouraged exploring them already.

I am not an AI expert, but I think that the paradigm is entering the area where the returns are drastically diminishing for learning outcome / training data. This may not be how “intelligence” works in people.

Humans, and even some clever animals can learn complex tasks and concepts with drastically less data. Children can learn new words after hearing it 5–20 times (except for swear words, which they can learn after 0.37 examples on average), one can learn riding a bike (an extremely complex combination of motions, reflexes, and coordination) in less than 100 hours total, a dog can learn a new trick just by observing another dog once, etc.

ufechner7 · August 10, 2024, 10:01am

And if you are using AI now, please consider making a donation for the victims of climate change: Solidarity

roflmaostc · August 10, 2024, 11:11am

Just to put it in perspective:
A 7 hour airplane flight consumes about 500MWh in total.
If the numbers are correct, the energy of 100 flights can train a GPT-4 model.

Not saying we shouldn’t be more efficient in training it but compared to real world activities such as mobility or construction, computers are still very efficient.

You can even imagine about training the model in location where the heat can be re-used (which is already done).

aplavin · August 10, 2024, 11:32am

Not sure what exactly you are comparing here…

If a model trains in a day and evaluates in a second, then inference takes much more than the square root of computations!
Even if we are talking in terms of cycles and not elementary operations, there are ~10^9 cycles per second and ~10^14 cycles per day. And 10^9 is much more than sqrt(10^14).

abraemer · August 10, 2024, 2:44pm

This whole square root does not make sense to me at all because it depends on the base units used. You cannot just square root the “number” in front of the units. Example: Suppose you nee 100MWh to train the model, then what would this “square-root” cost of an inference be?

\sqrt{10^2} MWh=10MWh?
\sqrt{10^5}kWh \approx 320kWh
\sqrt{10^8}Wh = 10^4Wh=10kWh
Similarly with “operations” or “cycles”.

PetrKryslUCSD · August 10, 2024, 3:40pm

Let us say a MFLOP has an energetic cost of C. Then M * MFLOP * C is the training cost, and sqrt(M) * MFLOP * C is the query cost. As pointed out above, if the speed of the system serving the query is in the GFLOP/s range and it spends a few seconds on it, the square-root argument is not too outlandish, given that it took similar system days to train.

Topic		Replies	Views
AI models are too costly Offtopic	79	2247	December 27, 2023
How much faster is GPU compare to CPU GPU	16	26983	November 24, 2018
On Machine Learning and Programming Languages Machine Learning	48	8881	January 25, 2018
I Have a dream! a Green dream -- Does Julia save energy? Performance	21	3006	December 18, 2021
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6800	October 6, 2018

Not good: AI is a problem, not a solution

Related topics