Stanford’s Alpaca lowered training cost to under $800, then came Koala, and now UC Berkeley improving more with Vicuna. It cost $300 to train, and I believe you can also get it for your local machine. It’s a bit slow for me (maybe the server just loaded). Alpaca only needed one GPU, and I assume the same for Vicuna, and it could be faster if you get it for your local machine.
I haven’t tested it yet how good it is at generating Julia code, but I was thinking, even if not good, someone could improve it locally to do it?
Would not be a problematic that Vicuna is created by fine-tuning pre-trained Llama? The cost of training LLama is not included in those 300$ and I would expect that to be a non-negligible sum. If Lllama’s training set did not contained sufficient volume of Julia source code, then I the chances are small we will get something useful.
That’s a good question. I’m not sure if anything can practically be done for Julia. As I understand, at a high-level, neural networks are trained in epocs, but in the end you end up with a fixed model that doesn’t learn more, ever and the static model, its parameters (i.e. weights and biases), are used as in in inference. I suppose that goes for Transformer models too.
There’s a problem called catastrophic forgetting, that applies if you try to train further. I’m not exactly sure why that doesn’t happen for individual epocs of training. I’m not up-to-speed on what fine-tuning is exactly, it seems to be adjusting the parameters further, without using more training data. But you would want more training data, avoiding or minimising the problem I stated. Maybe that is already possible, and the problem isn’t too bad?
Now, while Transformers (attention-mechanism) are the mainstream (e.g. in ChatGPT and those models I mentioned), I just recently became aware of something better!
I suppose you would need to train those models I describe below from scratch. It seems like it could be very costly, but they have better properties, time complexity, so that could help.
I became aware of state-space models, and variants of, that have better properties, but first want to list a brand-new paper from the same people, about RNNs again getting even better:
Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably well on long sequence modeling tasks, and have the added benefits of fast parallelizable training and RNN-like fast inference. However, while SSMs are superficially similar to RNNs, there are important differences that make it unclear where their performance boost over RNNs comes from. In this paper, we show that careful design of deep RNNs using standard signal propagation arguments can recover the impressive performance of deep SSMs on long-range reasoning tasks, while also matching their training speed. […] Our results provide new insights on the origins of the impressive performance of deep SSMs, while also introducing an RNN block called the Linear Recurrent Unit that matches both their performance on the Long Range Arena benchmark and their computational efficiency.
State space models (SSMs) are strong general-purpose sequence models […]
We’ll use our understanding to build H3 (Hungry Hungry Hippos), our new SSM layer for language modeling. With H3, we can replace almost all the attention layers in GPT-style transformers while beating or matching quality. We’ve scaled H3 up to 2.7B-parameters, and are releasing weights and code today.
[…]
For the purposes of our blog post, we won’t go into the details of how state space models are defined. We’ll summarize just a few key points:
SSMs scale with O(log N log N) in sequence length, instead of O(N^2) like attention – that makes them promising for long sequence modeling.
There’s no fixed context window, since SSMs admit a completely recurrent view.
This one is something totally different, but seems also of interest:
Thanks for interesting references. I already read few years ago that someone achieved similar performance to transformers with LSTMs, so I am keen to read those. Especially, if there is something about efficiency of training, because transformers are supernice when you batch, due to lack of recursion.
I had only heard of catastrophic forgetting, not much on solutions, or the (solved) “twin problem of catastrophic forgetting and remembering”, nor the 2021 Relevance Mapping Networks (RMNs) or the Optimal Overlap Hypothesis.
I doubt this October 2022 research from Google is incorporated into ChatGPT [Plus]/GPT-4 etc. (or older such research all from 2022?):
A Memory Transformer Network for Incremental Learning
One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks. In our paper, we propose to enhance the utilization of this memory bank: we not only use it as a source of additional training data like existing works […] We show that MTN achieves state-of-the-art performance on the challenging ImageNet-1k and Google-Landmarks-1k incremental learning benchmarks.
[…]
MTN is a light-weight transformer that makes a class prediction for a given query by directly modeling the relationship between this query and the feature representations of the exemplars in a memory bank. Since conditioning on the entire memory bank would be too computationally demanding, we choose to feed MTN only a reduced set of exemplars (selected with nearest neighbour search).
[…] MTN is inspired from recent memory transformer architectures in language modeling [39] and video recognition [36]
2021 research (not mentioning transformers):
Understanding Catastrophic Forgetting and Remembering in Continual Learning with Optimal Relevance Mapping
Catastrophic forgetting in neural networks is a significant problem for continual learning. A majority of the current methods replay previous data during training, which violates the constraints of
an ideal continual learning system. Additionally, current approaches that deal with forgetting ignore the problem of catastrophic remembering, i.e. the worsening ability to discriminate between data from different tasks. In our work, we introduce Relevance Mapping Networks (RMNs) which are inspired by the Optimal Overlap Hypothesis. The mappings reflects the relevance of the weights for the task at hand by assigning large weights to essential parameters. We show that RMNs learn an optimized representational overlap that overcomes the twin problem of catastrophic forgetting and remembering. Our approach achieves state-of-the-art performance across all common continual learning datasets, even significantly outperforming data replay methods while not violating the constraints for an ideal continual learning system. Moreover, RMNs retain the ability to detect data from new tasks in an unsupervised manner, thus proving their resilience against catastrophic remembering.
Only first solved in 2017 (then not for transformers as predating them):
Overcoming catastrophic forgetting in neural networks
In this work we propose a practical solution to train such models sequentially by protecting the weights important for previous tasks. This approach, inspired by synaptic consolidation in neuroscience, enables state of the art results on multiple reinforcement learning problems experienced sequentially.
[…]
Until now neural networks have not been capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks that they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks.
[…]
We demonstrate our approach is scalable and effective by solving a set of classification tasks based on a hand-written digit dataset and by learning several Atari 2600 games sequentially.
One more model to try out regarding Julia is OpenAssistant. Also based on LLaMA-30B, and see at 12:59 timestamp, “so we’re gonna look into providing you diff weights”, since the base model license is problematic: