LLM AI just for Julia? A proposal: Julia *plus* science LLM?

LLMs can now be trained (read this though as fine-tuned) in 30 min. by regular users (“Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following).”), so I think the Julia community should look into that (cost has come down to at least $200,000 to train an LLM from scratch vs $4+ million for ChatGPT). You can now finetune “33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU”; i.e on that state-of-the-art (openly released): “We find that Guanaco 65B is the best-performing model after GPT-4, achieving 99.3% performance relative to ChatGPT”.

Julia is on Google’s radar with their state-of-the-art PaLM 2 LLM (known smaller than original PaLM), but Julia is one of the “low-resource” programming languages in its training data, still with largest gain in performance. Thus benchmarks somewhat behind the leaders in their benchmark, Python and C++.

How could this be changed? Julia needs to have a larger fraction of the training data, not sure if it needs to be same size as for Python (currently not possible). The trend was for larger models until it was discovered that smaller can be better, and that is the trend now. And then you need to also scale down the training data put in.

StarCoder is the best open source LLM for programming languages (in general, maybe not for Julia?), much smaller than ChatGPT. I’m not sure about its context window, or how much it matters for programming languages, but MPT-7B already has twice the context window of ChatGPT/GPT 4, the former leader, and a large window seems helpful if you want to feed in large codebases, not just screenfuls of individual functions without context.

What I think is needed, is a) put all (or as much) of the available Julia code in as possible, and its docs, and b) just one human language English, at a minimum. Possibly something domain specific too, like on physics or biology, but exclude e.g. social science.

Do we need children book’s too? Seems like a strange question, but currently (no longer? at least recently) the training of large-language models feeds in the corpus in basically random order. It’s just not me questioning that as idiotic, curriculum learning is about not doing so:

Curriculum learning (CL) is a training strategy that trains a machine learning model from easier data to harder data, which imitates the meaningful learning order in human curricula. As an easy-to-use plug-in, the CL strategy has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision and natural language processing etc.

This seemingly directly applies to programming languages, e.g. Julia too. So what would be a “children’s book” for Julia? Maybe train first on some simple Julia learning examples (first without types?), some Jupyter notebook (any suggestions?), and:

https://en.wikibooks.org/wiki/Introducing_Julia

But before you train on code, since it may have comments and docs, it needs to learn at least some English (and just English, other languages would be about as bulky, but if 2+ languages then possibly 2x increase in training and inference time).

Currently the sources for English language corpus used in most LLMs are from all over the place, e.g. Wikipedia (for encyclopedic knowledge, but not needed? Only some Wikibooks on general programming concept and OS theory?), Reddit (most of it not relevant, possibly only Julia category there) and books, likely Moby Dick not needed… I’m not sure if trained on StackOverflow answers (allowed?), but it seems obvious to train on Julia Discourse. It’s the best language community I know of, and maybe larger than some others (or others scattered in many places, mostly not accessible or used in training of LLMs?).

There are already some specific LLM, such as:

I’m undecided if a Julia model should be trained on something (that) specific.

One other area we where Python is ahead, isn’t Pandas per se, I think Julia’s equivalents may have caught up. But there now already PandasAI LLM.

at 825 GB dwarfs the Julia ecosystem I would think, so possibly there’s too much natural language data vs. programming language (Julia). I haven’t explored what StarCoder and others do. It’s possible the “low-resource” of Julia, vs “high-resource” of Python and C++, isn’t about their relative size (that’s a fact, the question here if it matters), but also that relative size of programming language[s] to natural matters. If it does then adding e.g. Python to this Julia LLM might be useful.

For something not directly related, the two major brain regions involved in thinking and intelligence, the neocortex are largely understood already (Thousand brain theory); Transformer networks are now know to be at a high level, similar the hippocampus, i.e. “a close mathematical relationship of this transformer to current hippocampal models from neuroscience”:

https://arxiv.org/pdf/2112.04035.pdf

One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since
[…]
In this work we 1) show that transformers (with a little twist) recapitulate spatial representations
found in the brain; 2) show a close mathematical relationship of this transformer to current hippocampal models from neuroscience (with a focus on Whittington et al. (2020) though the same is
true for Uria et al. (2020)); 3) offer a novel take on the computational role of the hippocampus, and […] 5) discuss whether similar computational principles might apply to broader cognitive domains, such as language, either in the hippocampal formation orin neocortical circuits.

I think we’re very close to full AGI, and now just discovered:

By contrast, the biological brain can effectively address catastrophic forgetting through consolidating memories as more specific or more generalized forms to complement each other, which is achieved in the interplay of the hippocampus and neocortex, mediated by the prefrontal cortex. Inspired by such a brain strategy, we propose a novel approach named triple-memory networks (TMNs) for continual learning. TMNs model the interplay of the three brain regions as a triple-network architecture of generative adversarial networks (GANs). […] with implementing appropriate brain-inspired algorithms to alleviate catastrophic forgetting in each module. […]
TMNs achieve the state-of-the-art performance of generative memory replay on a variety of class-incremental learning benchmarks on MNIST, SVHN, CIFAR-10, and ImageNet-50.

Jan 2023:

To solve the catastrophic forgetting problem of learning temporal data in task incremental scenarios, in this research, we propose a novel method based on attentive recurrent neural networks, called Temporal Teacher Distillation (TTD). TTD solves the catastrophic forgetting problem in an attentive recurrent neural network based on three hypotheses, namely Rotation Hypothesis, Redundant Hypothesis, and Recover Hypothesis. […] not considering the Recover Hypothesis increases extra memory usage in continuously training different tasks. […]
According to experimental results, the proposed TTD significantly outperforms state-of-the-art methods by up to 14.6% and 45.1% in terms of accuracy and forgetting measures, respectively. To the best of our knowledge, this is the first work that studies continual learning in real-world incremental categories for temporal data classification with attentive recurrent neural networks and provides the proper application-oriented scenario.

Based on this simple framework, MeRec achieved leading performance with extremely small memory budget (only two feature vectors for each class) for continual learning on CIFAR-10 and CIFAR-100 datasets, with at least 50% accuracy drop reduction after several tasks compared to previous state-of-the-art approaches.

5 Likes

To be fair, I didn’t read beyond the first few paragraphs (that’s a lot of links and details, thanks!), but I’m in complete agreement - a Julia fine-tuned LLM could be very beneficial for our community! It would help us both get better docs, and better “autocomplete”, which are both sorely needed.

I also think the self-hosted models shpuld be far cheaper to train, in the hundreds of USD or less. Lots of recent developments have significantly brought down the costs of training competitive LLMs.

@williamfgc was very interested in this idea as well.

2 Likes

While I at first wrote $200 to train, I changed that first paragraph, to not have that cost for from scratch. But for fine-tuning it’s possible, and what that means is e.g. most recently LoRA, and 4-bit integer quantization and now those former methods have been improved upon, with not just 4-bit float as sometimes done before, but a different new kind, a 4-bit NormalFloat (“information theoretically optimal […]”) and more innovation:

QLORA: Efficient Finetuning of Quantized LLMs

We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLORA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes.
[…]
Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
[…]
Our QLORA finetuning method is the first method that enables the finetuning of 33B parameter
models on a single consumer GPU and 65B parameter models on a single professional GPU, while
not degrading performance relative to a full finetuning baseline. We have demonstrated that our
best 33B model trained on the Open Assistant dataset can rival ChatGPT on the Vicuna benchmark. Since instruction finetuning is an essential tool to transform raw pretrained LLMs into ChatGPT-like chatbots, […] QLORA can be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs.

I don’t think “instruction finetuning is an essential tool” at least another paper puts that into doubt, it can at least be quite limited to only 1000 prompts.

Do you have an idea of a good Julia dataset that could be used to fine-tune one of this open LLMs? I believe that this is one of the weak points of why we don’t see it been added to the training of these models. Julia is a young and still not so very popular language, so there is not so many code around to use it to train these models. I believe that if at least we have a good dataset with a permissive license, it could be improved with something like what is proposed in “Textbooks is all you need” to build a good Julia LLM

And now we have a recipe for not just a fine-tuned, but trained from scratch, state-of-the-art model for Julia (and likely Python too) in about 5 days of training time!

Yes, thanks for the interesting paper link, and as I said we could replicate the methodology, to pretrain only for Julia (then fine-tune also as explained in the paper), or if you meant reuse the model and fine-tune only on Julia too then, might plausibly work too in just a 4 hours.

Date Model Model size Dataset size HumanEval MBPP (Tokens) (Pass@1) (Pass@1)
2021 Jul Codex-300M [CTJ+21] 300M 100B 13.2% -
2021 Jul Codex-12B [CTJ+21] 12B 100B 28.8% -
[…]
2022 Apr PaLM-Coder [CND+22] 540B 780B 35.9% 47.0%
2022 Sep CodeGeeX [ZXZ+23] 13B 850B 22.9% 24.4%
2022 Nov GPT-3.5 [Ope23] 175B N.A. 47% -
2022 Dec SantaCoder [ALK+23] 1.1B 236B 14.0% 35.0%
2023 Mar GPT-4 [Ope23] N.A. N.A. 67% -
2023 Apr Replit [Rep23] 2.7B 525B 21.9% -
2023 Apr Replit-Finetuned [Rep23] 2.7B 525B 30.5% -
2023 May CodeGen2-1B [NHX+23] 1B N.A. 10.3% -
2023 May CodeGen2-7B [NHX+23] 7B N.A. 19.1% -
2023 May StarCoder [LAZ+23] 15.5B 1T 33.6% 52.7%
2023 May StarCoder-Prompted [LAZ+23] 15.5B 1T 40.8% 49.5%
2023 May PaLM 2-S [ADF+23] N.A. N.A. 37.6% 50.0%
2023 May CodeT5+ [WLG+23] 2B 52B 24.2% -
2023 May CodeT5+ [WLG+23] 16B 52B 30.9% -
2023 May InstructCodeT5+ [WLG+23] 16B 52B 35.0% -
2023 Jun WizardCoder [LXZ+23] 16B 1T 57.3% 51.8%
2023 Jun phi-1 1.3B 7B 50.6% 55.5%

Table 1: We use self-reported scores whenever available. Despite being trained at vastly smaller scale, phi-1 outperforms competing models on HumanEval and MBPP, except for GPT-4 (also WizardCoder obtains better HumanEval but worse MBPP).
[…]
In Section 2, we give some details of our training process, and we discuss evidence for the
importance of our data selection process in achieving this result. Moreover, despite being trained on
much fewer tokens compared to existing models, phi-1 still displays emergent properties.

The Stack already has Julia in its dataset, seemingly enough code, i.e. 3+ GB (vs. e.g. 190.73 GB for Pythons, but only a tiny fraction of that used by the above paper).

Language The Stack† CodeParrot† AlphaCode CodeGen PolyCoder†
Assembly 2.36 0.78
Batchfile 1.00 0.7
C 222.88 183.83 48.9 55
C# 128.37 36.83 38.4 21
C++ 192.84 87.73 290.5 69.9 52
[…]
Haskell 6.95 1.85
HTML 746.33 118.12
Java 271.43 107.7 113.8 120.3 41
JavaScript 486.20 87.82 88 24.7 22
Julia 3.09 0.29
[…]
Python 190.73 52.03 54.3 55.9 (217.3) 16
[…]
Total 3135.95 872.95 715.1 314.1 253.6
Table 1: The size of The Stack (in GB) compared to other source code datasets used for pre-training LLMs.† indicates the dataset is publicly released. The Stack is more than three times the size of CodeParrot, the next-largest released code dataset.
[…]
We see that the all-license dataset contains over 29 TB of data. Only selecting permissively licensed files reduces the dataset to 3.1 TB, i.e. only roughly 10% of the dataset is kept. […] We might be able to increase [that subset] by adding more licenses to the permissive licenses list. […] For the permissive license dataset, the four biggest languages–HTML (746 GB), Javascript (486 GB), Java (271 GB), and C (222 GB)–consume more than 55% of the dataset size. [Julia is 25th.]

https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Here are some helpful rules of thumb for understanding tokens in terms of lengths:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words

So we have 3.09/4 = 0.77 billion tokens of Julia code, right there at least (that might be pessimistic, I’m not sure if 4 is a good rule of thumb for code). All-licenses for Julia is 21.75 GB while the Permissive subset is 3.09 GB (strange I thought mist use MIT? And GPL/copyleft a tiny fraction, what is skipped), so that would be 5.4 billion tokens.

Did you have this in mind (we might be able to skip this part, i.e. use what’s already done for just Python, or additionally, make such synthetics Julia code, possibly be transpiling that synthetic Python code to Julia with the available transpiler):

2.2 Creation of synthetic textbook-quality datasets
One of the main challenges in creating a high-quality dataset for code generation is ensuring that the examples are diverse and non-repetitive. […]
The synthetic textbook dataset
This dataset consists of less that 1B tokens of GPT-3.5 generated Python textbooks, synthesized to
provide a high-quality source of natural language heavy text interleaved with relevant code snippets. […]
The CodeExercises dataset
This is a small synthetic exercises dataset consisting of less than 180M tokens of Python exercises and solutions.

We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly.

1 Like