So thats the reason I donâ€™t have AI programs on my PC as a hobby. Takes too much floating point operations to â€śtrainâ€ť them. At these rate, you need to have the financial resources of Bruce Wayne to train AI on your PC.

The point is, you just train it once, and then you can use it many times without training, which doesnâ€™t consume much energy.

Training a model requires evaluating it many times on many different inputs. Using the model requires evaluating it one time with a given input. But, once itâ€™s released to the world, a lot of people are submitting inputs. So, I would speculate that a lot of energy is being used. Probably, a search now uses more energy than a search several years ago.

Honest question: how many times are models trained before they are released?

Does that means if my CPU have multiple teraflops, I will be safe to train and use AI on my PC???

Putting a query to an AI system typically requires roughly the square root of the amount of computing used to train it.

I had never heard that claim before. Can someone point to a reference or give a hand-wavy or better rationale for why that should be the case?

Chatgpt doesnâ€™t want to discuss the current modelâ€™s energy use, but it does tell us something about older models:

â€śStudies have estimated that training large models like GPT-3 could consume hundreds of megawatt-hours (MWh) of energy.â€ť

â€śA typical inference (i.e., generating a response to a query) using a large model like GPT-3 might consume around 0.1 to 1 Wh of energy.â€ť

So, a single query uses much less than the square root of the energy used to train the model. That is in accordance with my own experience with neural nets. A model can take on the order of a day to train on a GPU, but will evaluate an input in time on the order of seconds.

Training these big models is certainly very expensive, and each query is certainly more expensive than a query to a model from 10 years ago. On the other hand, the results may be better, so perhaps the incentives to make many queries are less.

The reference is clearly incorrect as stated. But Iâ€™m guessing that it meant to say that the computational load of a single evaluation of the loss function during training is the square number of the simple eval of the model in query-time. Itâ€™s just comparing doing the full autodiff gradient with the simple eval. So single-eval vs single-loss-with-gradient (not looking at the full training â€“ thatâ€™s crazy).

That sounds wrong to me. The cost of computing the gradient with backpropagation / reverse mode is only roughly double the cost of the forward evaluation. Counter-intuitively, the cost of a gradient *doesnâ€™t* scale proportionally to the number of parameters.

Looking at the graph, I am wondering if there is a theoretical motivation for the break at 10 on the x axis.

If not, then it is fair game to fit whatever kind of curve on these points. Visually the break is not apparent, and â€” purely from a statistical standpoint â€” arbitrary regression discontinuity is notoriously unreliable for prediction and inference.

the break is there because of AlexNet in 2012. This is usually considered the beginning of deep learning

This reminds me of a paper that I read years ago:

Interesting proposal, but since electricity, hardware, and the associated labor costs are so large for â€śRed AIâ€ť, I imagine that if there were low hanging fruits for decreasing any of them and still keep models useful (â€śGreen AIâ€ť), the economic incentives would have encouraged exploring them already.

I am not an AI expert, but I think that the paradigm is entering the area where the returns are drastically diminishing for learning outcome / training data. This may not be how â€śintelligenceâ€ť works in people.

Humans, and even some clever animals can learn complex tasks and concepts with drastically less data. Children can learn new words after hearing it 5â€“20 times (except for swear words, which they can learn after 0.37 examples on average), one can learn riding a bike (an extremely complex combination of motions, reflexes, and coordination) in less than 100 hours total, a dog can learn a new trick just by observing another dog *once*, etc.

And if you are using AI now, please consider making a donation for the victims of climate change: Solidarity

Just to put it in perspective:

A 7 hour airplane flight consumes about 500MWh in total.

If the numbers are correct, the energy of 100 flights can train a GPT-4 model.

Not saying we shouldnâ€™t be more efficient in training it but compared to real world activities such as mobility or construction, computers are still very efficient.

You can even imagine about training the model in location where the heat can be re-used (which is already done).

Not sure what exactly you are comparing hereâ€¦

If a model trains in a day and evaluates in a second, then inference takes much more than the square root of computations!

Even if we are talking in terms of cycles and not elementary operations, there are ~10^9 cycles per second and ~10^14 cycles per day. And 10^9 is much more than sqrt(10^14).

This whole square root does not make sense to me at all because it depends on the base units used. You cannot just square root the â€śnumberâ€ť in front of the units. Example: Suppose you nee 100MWh to train the model, then what would this â€śsquare-rootâ€ť cost of an inference be?

- \sqrt{10^2} MWh=10MWh?
- \sqrt{10^5}kWh \approx 320kWh
- \sqrt{10^8}Wh = 10^4Wh=10kWh

Similarly with â€śoperationsâ€ť or â€ścyclesâ€ť.

Let us say a MFLOP has an energetic cost of C. Then M * MFLOP * C is the training cost, and sqrt(M) * MFLOP * C is the query cost. As pointed out above, if the speed of the system serving the query is in the GFLOP/s range and it spends a few seconds on it, the square-root argument is not too outlandish, given that it took similar system days to train.