So thats the reason I don’t have AI programs on my PC as a hobby. Takes too much floating point operations to “train” them. At these rate, you need to have the financial resources of Bruce Wayne to train AI on your PC.
The point is, you just train it once, and then you can use it many times without training, which doesn’t consume much energy.
Training a model requires evaluating it many times on many different inputs. Using the model requires evaluating it one time with a given input. But, once it’s released to the world, a lot of people are submitting inputs. So, I would speculate that a lot of energy is being used. Probably, a search now uses more energy than a search several years ago.
Honest question: how many times are models trained before they are released?
Does that means if my CPU have multiple teraflops, I will be safe to train and use AI on my PC???
Putting a query to an AI system typically requires roughly the square root of the amount of computing used to train it.
I had never heard that claim before. Can someone point to a reference or give a hand-wavy or better rationale for why that should be the case?
Chatgpt doesn’t want to discuss the current model’s energy use, but it does tell us something about older models:
“Studies have estimated that training large models like GPT-3 could consume hundreds of megawatt-hours (MWh) of energy.”
“A typical inference (i.e., generating a response to a query) using a large model like GPT-3 might consume around 0.1 to 1 Wh of energy.”
So, a single query uses much less than the square root of the energy used to train the model. That is in accordance with my own experience with neural nets. A model can take on the order of a day to train on a GPU, but will evaluate an input in time on the order of seconds.
Training these big models is certainly very expensive, and each query is certainly more expensive than a query to a model from 10 years ago. On the other hand, the results may be better, so perhaps the incentives to make many queries are less.
The reference is clearly incorrect as stated. But I’m guessing that it meant to say that the computational load of a single evaluation of the loss function during training is the square number of the simple eval of the model in query-time. It’s just comparing doing the full autodiff gradient with the simple eval. So single-eval vs single-loss-with-gradient (not looking at the full training – that’s crazy).
That sounds wrong to me. The cost of computing the gradient with backpropagation / reverse mode is only roughly double the cost of the forward evaluation. Counter-intuitively, the cost of a gradient doesn’t scale proportionally to the number of parameters.
Looking at the graph, I am wondering if there is a theoretical motivation for the break at 10 on the x axis.
If not, then it is fair game to fit whatever kind of curve on these points. Visually the break is not apparent, and — purely from a statistical standpoint — arbitrary regression discontinuity is notoriously unreliable for prediction and inference.
the break is there because of AlexNet in 2012. This is usually considered the beginning of deep learning
This reminds me of a paper that I read years ago:
Interesting proposal, but since electricity, hardware, and the associated labor costs are so large for “Red AI”, I imagine that if there were low hanging fruits for decreasing any of them and still keep models useful (“Green AI”), the economic incentives would have encouraged exploring them already.
I am not an AI expert, but I think that the paradigm is entering the area where the returns are drastically diminishing for learning outcome / training data. This may not be how “intelligence” works in people.
Humans, and even some clever animals can learn complex tasks and concepts with drastically less data. Children can learn new words after hearing it 5–20 times (except for swear words, which they can learn after 0.37 examples on average), one can learn riding a bike (an extremely complex combination of motions, reflexes, and coordination) in less than 100 hours total, a dog can learn a new trick just by observing another dog once, etc.
And if you are using AI now, please consider making a donation for the victims of climate change: Solidarity
Just to put it in perspective:
A 7 hour airplane flight consumes about 500MWh in total.
If the numbers are correct, the energy of 100 flights can train a GPT-4 model.
Not saying we shouldn’t be more efficient in training it but compared to real world activities such as mobility or construction, computers are still very efficient.
You can even imagine about training the model in location where the heat can be re-used (which is already done).
Not sure what exactly you are comparing here…
If a model trains in a day and evaluates in a second, then inference takes much more than the square root of computations!
Even if we are talking in terms of cycles and not elementary operations, there are ~10^9 cycles per second and ~10^14 cycles per day. And 10^9 is much more than sqrt(10^14).
This whole square root does not make sense to me at all because it depends on the base units used. You cannot just square root the “number” in front of the units. Example: Suppose you nee 100MWh to train the model, then what would this “square-root” cost of an inference be?
- \sqrt{10^2} MWh=10MWh?
- \sqrt{10^5}kWh \approx 320kWh
- \sqrt{10^8}Wh = 10^4Wh=10kWh
Similarly with “operations” or “cycles”.
Let us say a MFLOP has an energetic cost of C. Then M * MFLOP * C is the training cost, and sqrt(M) * MFLOP * C is the query cost. As pointed out above, if the speed of the system serving the query is in the GFLOP/s range and it spends a few seconds on it, the square-root argument is not too outlandish, given that it took similar system days to train.