LLMs can now be trained (read this though as fine-tuned) in 30 min. by regular users (“Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following).”), so I think the Julia community should look into that (cost has come down to at least $200,000 to train an LLM from scratch vs $4+ million for ChatGPT). You can now finetune “33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU”; i.e on that state-of-the-art (openly released): “We find that Guanaco 65B is the best-performing model after GPT-4, achieving 99.3% performance relative to ChatGPT”.
Julia is on Google’s radar with their state-of-the-art PaLM 2 LLM (known smaller than original PaLM), but Julia is one of the “low-resource” programming languages in its training data, still with largest gain in performance. Thus benchmarks somewhat behind the leaders in their benchmark, Python and C++.
How could this be changed? Julia needs to have a larger fraction of the training data, not sure if it needs to be same size as for Python (currently not possible). The trend was for larger models until it was discovered that smaller can be better, and that is the trend now. And then you need to also scale down the training data put in.
StarCoder is the best open source LLM for programming languages (in general, maybe not for Julia?), much smaller than ChatGPT. I’m not sure about its context window, or how much it matters for programming languages, but MPT-7B already has twice the context window of ChatGPT/GPT 4, the former leader, and a large window seems helpful if you want to feed in large codebases, not just screenfuls of individual functions without context.
What I think is needed, is a) put all (or as much) of the available Julia code in as possible, and its docs, and b) just one human language English, at a minimum. Possibly something domain specific too, like on physics or biology, but exclude e.g. social science.
Do we need children book’s too? Seems like a strange question, but currently (no longer? at least recently) the training of large-language models feeds in the corpus in basically random order. It’s just not me questioning that as idiotic, curriculum learning is about not doing so:
Curriculum learning (CL) is a training strategy that trains a machine learning model from easier data to harder data, which imitates the meaningful learning order in human curricula. As an easy-to-use plug-in, the CL strategy has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision and natural language processing etc.
This seemingly directly applies to programming languages, e.g. Julia too. So what would be a “children’s book” for Julia? Maybe train first on some simple Julia learning examples (first without types?), some Jupyter notebook (any suggestions?), and:
But before you train on code, since it may have comments and docs, it needs to learn at least some English (and just English, other languages would be about as bulky, but if 2+ languages then possibly 2x increase in training and inference time).
Currently the sources for English language corpus used in most LLMs are from all over the place, e.g. Wikipedia (for encyclopedic knowledge, but not needed? Only some Wikibooks on general programming concept and OS theory?), Reddit (most of it not relevant, possibly only Julia category there) and books, likely Moby Dick not needed… I’m not sure if trained on StackOverflow answers (allowed?), but it seems obvious to train on Julia Discourse. It’s the best language community I know of, and maybe larger than some others (or others scattered in many places, mostly not accessible or used in training of LLMs?).
There are already some specific LLM, such as:
I’m undecided if a Julia model should be trained on something (that) specific.
One other area we where Python is ahead, isn’t Pandas per se, I think Julia’s equivalents may have caught up. But there now already PandasAI LLM.
at 825 GB dwarfs the Julia ecosystem I would think, so possibly there’s too much natural language data vs. programming language (Julia). I haven’t explored what StarCoder and others do. It’s possible the “low-resource” of Julia, vs “high-resource” of Python and C++, isn’t about their relative size (that’s a fact, the question here if it matters), but also that relative size of programming language[s] to natural matters. If it does then adding e.g. Python to this Julia LLM might be useful.
For something not directly related, the two major brain regions involved in thinking and intelligence, the neocortex are largely understood already (Thousand brain theory); Transformer networks are now know to be at a high level, similar the hippocampus, i.e. “a close mathematical relationship of this transformer to current hippocampal models from neuroscience”:
One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since
In this work we 1) show that transformers (with a little twist) recapitulate spatial representations
found in the brain; 2) show a close mathematical relationship of this transformer to current hippocampal models from neuroscience (with a focus on Whittington et al. (2020) though the same is
true for Uria et al. (2020)); 3) offer a novel take on the computational role of the hippocampus, and […] 5) discuss whether similar computational principles might apply to broader cognitive domains, such as language, either in the hippocampal formation orin neocortical circuits.
I think we’re very close to full AGI, and now just discovered:
By contrast, the biological brain can effectively address catastrophic forgetting through consolidating memories as more specific or more generalized forms to complement each other, which is achieved in the interplay of the hippocampus and neocortex, mediated by the prefrontal cortex. Inspired by such a brain strategy, we propose a novel approach named triple-memory networks (TMNs) for continual learning. TMNs model the interplay of the three brain regions as a triple-network architecture of generative adversarial networks (GANs). […] with implementing appropriate brain-inspired algorithms to alleviate catastrophic forgetting in each module. […]
TMNs achieve the state-of-the-art performance of generative memory replay on a variety of class-incremental learning benchmarks on MNIST, SVHN, CIFAR-10, and ImageNet-50.
To solve the catastrophic forgetting problem of learning temporal data in task incremental scenarios, in this research, we propose a novel method based on attentive recurrent neural networks, called Temporal Teacher Distillation (TTD). TTD solves the catastrophic forgetting problem in an attentive recurrent neural network based on three hypotheses, namely Rotation Hypothesis, Redundant Hypothesis, and Recover Hypothesis. […] not considering the Recover Hypothesis increases extra memory usage in continuously training different tasks. […]
According to experimental results, the proposed TTD significantly outperforms state-of-the-art methods by up to 14.6% and 45.1% in terms of accuracy and forgetting measures, respectively. To the best of our knowledge, this is the first work that studies continual learning in real-world incremental categories for temporal data classification with attentive recurrent neural networks and provides the proper application-oriented scenario.
Based on this simple framework, MeRec achieved leading performance with extremely small memory budget (only two feature vectors for each class) for continual learning on CIFAR-10 and CIFAR-100 datasets, with at least 50% accuracy drop reduction after several tasks compared to previous state-of-the-art approaches.