An LLM fine-tuned for Julia, call for comments + help

Overview

There’s been some discussion in the Julia slack about fine-tuning a model for Julia specifically, in part because the language does not have as large an existing code base as other, much older languages. There’s an existing Discourse post on it, but I wanted to make a separate thread for building a specific plan and hopefully gather some volunteers.

Goals

  • Gather a high-quality Julia-specific training set.
  • Publish the dataset to HuggingFace for use in future models trained by others.
  • Produce a fine-tuned LLM specifically for Julia.

Specifics

Training data

  • Retrieve all code posted in General, with an optional stretch goal for general Julia code that is in repos but not published in repos, but has tests and seems to function.
  • Remove cases with incompatible licenses for fine-tuning – we want to respect licensure. I’m not sure what licenses exactly work, but this seems to suggest Apache 2.0, MIT, and the BSD licenses should work. Not sure what to do about repos that do not have licenses.
  • Re-format all the files with each formatting style to increase the number of code samples. I’d imagine we should include the un-formatted code as well.
  • Remove code for older versions of Julia, i.e. produced prior to Julia 1.0 or maybe even 1.6. Could try a few cutoff points.
  • We could also consider synthetic data, where we ask an LLM to generate code. We’d then evaluate it to see that it works the way we want, and then include that in the training set. There is some debate about whether this is a good idea.

EDIT: I also wonder if just sneaking more high quality code into public datasets just gets the big providers to use the code. They’re always data hungry and just tidying up the data could just get us a long way.

Fine tuning

  • There’s some discussion about whether we want next-token for code completion only (i.e. copilot-type stuff) or instruction tuned models (like Claude/ChatGPT, etc.)
  • Code completion is the easiest.
  • Instruction tuning requires that we have explanations of what’s going on in the code, and it’s a little out of my wheelhouse. I’d love to hear some advice here!
  • We can use unsloth or HuggingFace. Probably one of the many other tools for this.
  • Possibly provide an optional callback or telemetry function to AIHelpMe.jl to let users share instructions that are high quality – we’d stick these in a database somewhere and use them for fine-tuning.

@tylerjthomas9 I believe has done some work here, love to hear what kind of stuff you have.

Via @svilupp:

Re instruct dataset, yeah, other models are how everyone does - you build a pipeline where you generate&curate to have a high quality dataset (ie, duplicative stuff won’t do as much). I think Meta even said that’s how they did llama3 (using llama2 70b for the dataset prep)

Tooling

  • Users can serve local models to VS Code with this extension or LM Studio via Ollama/vLLM/etc.
  • We could provide a cloud service and give API key to people and maybe let them pay to use cloud services. Depending on the model we train, this could be fairly inexpensive, but it’s a larger infrastructure discussion.
  • Any other comments here?

Next steps

  • I want to hear from you all what you think!
  • If we like the idea, I want to set up a community call every week or every other week to try and track who is doing what and what people need to move ahead.

Here’s the Slack thread for posterity.

Slack thread

Mateusz Kaduk 5 days ago

Are you aware for any “CoPilot” but Julia optimized? For example CoPilot or especially ChatGPT seems to be sometimes ignoring the 1-index based or col major, and often replicate the order of dimensions in an array as if it was PyTorch.

77 replies

SixZero 5 days ago

Actually I proposed an idea, to finetune a model which learns from Discourse/Slack #helpdesk channel conversations to finetune a model, which can help in the 9 out of 10 same questions coming to the platforms, but it was not loved that much, so I guess back then the community was not really happy for LLMs.

SixZero 5 days ago

I looked into Slack’s Terms of Use and according to that, I think saving messages for LLM train is actually allowed (I hope I am not missing any point in the terms of use).

Mateusz Kaduk 5 days ago

Can github repository of all repos with Julia code be used? Or would license not allow it?

SixZero 5 days ago

I think what is a public repo on github and licenses are free, I guess would be possible to be used. But yeah, I might be too “open” minded on this. (edited)

Mateusz Kaduk 5 days ago

Most repos should have LICENSE file attached that is standard. (edited)

SixZero 5 days ago

Yeah, we might separate out the ones which have no LICENSE, but I think that is still super big amount of code to be used for learning.

SixZero 5 days ago

Immagine asking something on Slack, and receiving an answer within 10 seconds in the specific channels we allow this bot to run. (Not waiting 1hour for someone to answer your question.) (edited)

SixZero 5 days ago

I wonder what accuracy we would get if it was only a RAG supported AI solution.

SixZero 5 days ago

Basically we would connect in @svilup’s AIHelpMe.jl package xD

Mateusz Kaduk 5 days ago

The major issue is column major and 1-index base for many standard algorithms, because they were implemented in another language. I am not sure if this can be completely solved with fine-tuning. I was thinking of something like JuliaCoPilot if it exsits. I know chatGPT allows creating “experts chats” which I suppose is some sort of tuning. But they also make these mistakes.

SixZero 5 days ago

I am pretty happy with gpt4.0 on this front, and some assistants can use gpt4.0 too. Btw I might be a little bit stupid somewhere here, because I don’t see why we are not utilizing LLMs more

Mateusz Kaduk 5 days ago

I think it would be cool project for Julia community putting all these APIs into practical use for Julia!

1

Cameron Pfiffer 5 days ago

I’m curious – does anyone have all the Julia source code? We could try fine-tuning one of the smaller models

1

SixZero 5 days ago

“all the Julia source code” hehe sounds funny

SixZero 5 days ago

I hope we don’t have to write some webscraper for this tho. (edited)

SixZero 5 days ago

Btw AFAI remember JuliaHub had some model, which was just for finding packages for us, right? But probably tht was RAG and not a finetune.

Cameron Pfiffer 5 days ago

Yeah it was RAG

Cameron Pfiffer 5 days ago

@aviks do you guys have a cache of Julia code somewhere?

Mateusz Kaduk 5 days ago

I meant majority of Julia packages are hosted on GitHub

Cameron Pfiffer 5 days ago

I am aware, I mean downloaded and stored somewhere.

Cameron Pfiffer 5 days ago

It’s a pain to go download them again if someone has already done so.

Avik Sengupta 5 days ago

We don’t have any dataset stashed, but I think I know someone who was working on curating something like this. I’ll try to see if they can chime in here.

Mateusz Kaduk 5 days ago

It would probably be curating that package registry GitHub - JuliaRegistries/General: The official registry of general Julia packages or having script iterating and git clone

1

SixZero 5 days ago

Correct me if I am wrong, but if we need the dataset we just need the list of repo urls and then with any script just download off them.

  1. So first get the list of packages we need. (I don’t know if all julia packages would make sense, or just the most popular ones)
  2. Download them
  3. We need a way to process the julia package format, like go over all files in src directory and doc/test/run folders too?

(edited)

Tyler Thomas 5 days ago

I have started working on this. I have a dataset that has all packages in the general registry (plus Julia base code) and any documentation that I could pull. I’d love to collaborate with anyone else interested in this (edited)

11

SixZero 5 days ago

I would ask whether we have some good idea on what the 3rd step could look like, is there a package, where we can define a protocol, on how to use a specific dataset, or something. I mean what if there were a system which could annotate automatically the textual data in some formats. I wonder if it makes sense, but:

  • I think the train data should benefits from any annotation data that is associatable with the data. So each file should get the annotation in what file they are in, and in what package they are…
  • I would guess popularity of packages would also be a somewhat good label for the data.
  • What else could we come up with?

I don’t know if I am right here, but tell me if you guys think it is not how this is done. (edited)

Tyler Thomas 5 days ago

One side benefit from creating and maintaining a dataset would be that future foundation models would be more likely to have newer Julia data.As for your annotation data, we could do this with some chat template like chatml to make inference easier.I have also wonder if we formatted all the code to one style guide, would the downstream model be better at producing code that executes without syntax errors? (edited)

1

SixZero 5 days ago

What do you mean by formatting all the code? What formatting you meant?

Cameron Pfiffer 5 days ago

i.e. bluestyle or something

1

Tyler Thomas 5 days ago

All blue style or some other style from JuliaFormatter.jl

Cameron Pfiffer 5 days ago

I think it’s a good idea

Cameron Pfiffer 5 days ago

Then recs from the model would have consistent formatting

SixZero 5 days ago

I think, it should handle all formats what people use in real life. But if we would need more data then it is good way to augment the dataset. (edited)

Cameron Pfiffer 5 days ago

Oooo I see what you mean

Cameron Pfiffer 5 days ago

Like fork all the files and format them all

Cameron Pfiffer 5 days ago

Then num formats * num scripts = samples (edited)

11

Tyler Thomas 5 days ago

That’s an interesting idea. I’ll make my initial version of the dataset public this weekend incase anyone wants to experiment

Cameron Pfiffer 5 days ago

I’ll tinker. Never done any fine tuning before but I’d love to fuck around

SixZero 5 days ago

Same.

Tyler Thomas 5 days ago

I’d be happy to jump on a call and walk you through what I’ve done too

SixZero 5 days ago

I would want to have a good 3rd step solution, to create a way to make the data processable for different formats: like “packages”, but if a package directory has .html in it, then that part would be used as a html data… and so one…

Mateusz Kaduk 5 days ago

Aren’t there standardized pipelines and tokenizers for things like code that handle most of that stuff ? I have no clue but I would imagine there are ?

SixZero 5 days ago

But I am a little bit afraid of the fact, julia would perform not that good at text processing tasks xD

SixZero 5 days ago

I would love to know, because then we wouldn’t need to work this out xD

Tyler Thomas 5 days ago

We can tokenize any document in the GitHub repos and add it to the training data. I did this in the first pass with the markdown files like README.md

SixZero 5 days ago

this was the talk, on something similar to this. Slack
Eventually not just textual but images, sound and video formats could be handled by this.

SixZero 5 days ago

Yeah, tokenizing from text is straigth forward I guess, but how do we order the train data (or context). I might be going crazy but I would want a ready made solution for this.

Cameron Pfiffer 5 days ago

probably this Fine-tune a pretrained model

huggingface.co

Fine-tune a pretrained model

We’re on a journey to advance and democratize artificial intelligence through open source and open science. (34 kB)

Fine-tune a pretrained model

SixZero 5 days ago

I heard from this guy this solution is -40% vram? and there are some efficiency improvements?
Google Colab

Tyler Thomas 5 days ago

Unsloth is great for single gpu finetuning (especially with qlora), which is probably all we’d need for this (edited)

Cameron Pfiffer 5 days ago

oh that’s sick

Mateusz Kaduk 5 days ago

Maybe this is useful ? https://huggingface.co/bigcode/starcoderbase

huggingface.co

bigcode/starcoderbase · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science. (422 kB)

https://huggingface.co/bigcode/starcoderbase

Mateusz Kaduk 5 days ago

and Datasets - BigCode

BigCodeBigCode

Datasets

As part of the BigCode project, we released and will maintain The Stack, a 6.4 TB dataset of permissively licensed source code in 358 programming languages, along with a collection of datasets created through the course of research during the project.
Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

Nov 16th, 2020

Datasets - BigCode

Mateusz Kaduk 5 days ago

There is Julia there as well

Tyler Thomas 5 days ago

I went through a couple of them, and there was less Julia code than my initial pull of the general registry code. I think it would be good to combine non package code. Also, a lot of the code is older, so newer Julia packages and features will be missing

Mateusz Kaduk 5 days ago

Yeah, I suspect, its probably like a matter of setting this up with the right data but maybe a good starting point.

Mateusz Kaduk 5 days ago

@Tyler Thomas regarding old code that is valid point, I think deprecated stuff should be excluded as it just leads to model giving bad advice of some functions that do not even exist anymore

Jan Siml 5 days ago

I love the energy!My 2c:

  • what is the goal of the finetuning project? do you have some specific workflows or tasks that you’re targeting?
  • are you planning to simply train on raw code? or use the raw code to create synthetic instruction dataset?

I have very limited expertise here, but I’ve been following a few fine-tunes for multilingual models (eg, german embeddings). In the end, the biggest lift was the curation of instruct dataset.The tasks/goal is crucial, because it will help focus the data curation + measure the progress. “Generic” model might be a death by thousand papercuts. At least in the start.As for the data, I presume we wouldn’t want to build a foundation model by doing next token prediction on source code (anyone has spare $1M? ). The most most common (and probably the only feasible path) is to finetune it, in which case we wouldn’t be pumping in raw script files, but rather highly curated conversations that reflect the behaviors/tasks/etc we want.
In my understanding, forcing all the source code would just undo a lot of the instruction tuning and we wouldn’t have the money to fix it (and expertise).In terms of the finetuning itself, that’s easy (especially at first, with things like LORA). I have good experience with Axolotl and can share the set up and all scripts to ultimately distil it into GGUF. The GPU compute was trivial and cheap on Jarvis.ai - I think I paid like half of dollar for a mini tuning job. But in general, once we have a good dataset, we can first test it a bit with some commercial finetuning service (OpenAI, Together etc) to see quickly how well it works without having to fiddle the loss curves and GPU utilizations

Cameron Pfiffer 5 days ago

  1. The goal is to have a Julia-optimized code model. Next-token is probably better here for now than instruction tuned, for exactly the reason you outlined.
  2. We would not do a new foundation model, that is silly – pick one of the open source code models and fine tune that.

It is a good question though – how do we get instruct data?

1

Cameron Pfiffer 5 days ago

Hm if only the slack history was not completely hidden…

Tyler Thomas 5 days ago

I think that we would want to finetune (with lora/qlora/…) several models. One with the fill in the middle objective on raw Julia code. This would be best for a copilot replacement that is hopefully better at Julia. The second would be the instruct model that you are talking about. I’m not sure what is the best way to get enough high quality examples. I could also see us wanting to finetune a model on next token prediction then instruct tune it.

Tyler Thomas 5 days ago

Maybe we could use llama 70B or an api to automatically generate instructions. This could at least give us a baseline

Cameron Pfiffer 5 days ago

I do actually think synthetic data here would be kind of cool

Cameron Pfiffer 5 days ago

We could also have a benchmarking/evaluation agent to filter down bad examples that are slow or don’t run

Jan Siml 5 days ago

I think compiling the instructions for some possible tasks are the most useful thing to start with.First, you can use it as is for finetuning and get good results.Second, if you do NTP, then you’ll have to do instruction tune anyway to still have a useful model (unless you’re building a code completion engine).My push is to focus, because we have neither the experience nor the resources, so we should start like everyone does and follow the tried and tested recipe.
(And, arguably, we’re really just trying to tweak the style/syntax towards Julia… adding knowledge via finetuning is usually discouraged)Re instruct dataset, yeah, other models are how everyone does - you build a pipeline where you generate&curate to have a high quality dataset (ie, duplicative stuff won’t do as much). I think Meta even said that’s how they did llama3 (using llama2 70b for the dataset prep)

Jan Siml 5 days ago

re evaluation and filtering, we would have some functionality and infrastructure already in place from the Leaderboard

Cameron Pfiffer 5 days ago

Sounds like AIHelpMe.jl is probably a great place to start for generating instruction tuned samples?

Jan Siml 5 days ago

Ideally, people would just use it and SHARE them

Jan Siml 5 days ago

that’s all we need

Jan Siml 5 days ago

for diversity

Cameron Pfiffer 5 days ago

Could add a telemetry option as well

Cameron Pfiffer 5 days ago

I.e. do you want to automatically submit this

Jan Siml 5 days ago

for quantity, we can generate higher quality faster but it would be too sterile (I suspect)

Jan Siml 5 days ago

Yeah, I’ve recently added callbacks so we could use it for that. Plus we can now serialize full RaGResult including context into JSON.

Also sent to the channel

Cameron Pfiffer 5 days ago

This all sounds very reasonable to me. I think my proposal of a next step would be put up an RFC on the Discourse about what we’re thinking and how best to do it. I’d love to have some kind of Julia bot that’s a little more up-to-date and syntactically aware

3

Mateusz Kaduk 4 days ago

Benchmark showing how copilot is bad at that vs how this is fixed by fine tuned model would be good to have.

Mateusz Kaduk 4 days ago

Maybe people contributing with problems they have with copilot when coding in Julia ? For example dot product attention is QK' in row major languages like Python but for Julia it should be all transposed as it is column major language so K'Q instead.

26 Likes

I’ve shared this before, and I assume there are JuliaHub folks in the Slack so maybe this has been mentioned, but JuliaHub already offers AskAI, a Julia LLM. That tool may be proprietary, but in any case perhaps there are ways to collaborate and share best practices.

I am certainly not an expert, but IIRC works without an explicit license (code or whatever) should be treated as copyrighted by the creator. So legally it may be better to only use code with explicit permissive licenses?

2 Likes

This is a RAG tool, not a fine-tuned model. A fine-tuned model would make the RAG tooling even better.

But yes, I agree that we should discard unlicensed code!

3 Likes

Ahhh, interesting, I was unaware of RAG (not my area). In any case, I’m looking forward to a good Julia helper tool!

2 Likes

Here is my initial dataset: tylerjthomas9/JuliaCode · Datasets at Hugging Face. I made this so that we could train models with up-to-date Julia code/libraries such as TidierDB.jl.

I am looking to add non-package Julia code from other code datasets / BigQuery, and documentation produced by Documenter.jl.

7 Likes

I think there are several reasons why we should be focusing on building and publishing a high-quality curated dataset, not (primarily) training or finetuning.

  1. We know Julia code, and most of us are not LLM experts. So let’s play to our strengths.
  2. If you build it, they will come (i.e. the pros will use it). You pointed this out too. Just getting more high-quality Julia code into the big general models would be a huge win. Maybe we can use our network to reach out to some top LLMs and outright ask how we can get them to use our dataset.
  3. The LLMs are evolving and getting retrained from scratch many times faster than what we can keep up with on the finetuning side. And many actors are involved. Just last week there were 5 or so major releases of new local/offline models of different sizes with quality more or less comparable with GPT 3.5. If we make a finetune today it could well be completely outdated within a few weeks or months compared to the SOTA.
  4. Similarly, we can expect the same dataset to be useful for many instances and generations of LLMs, while our code base is relatively stable over time.

I’m not saying we should abandon the idea of a community finetune, just that the dataset is more important.

As for your training bullet points above (on formatting style, code for older Julia versions and whether to include synthetic data) I’d say include it all but tag it well and leave inclusion decisions to the finetuners. Or maybe just exclude code for Julia v0.x that never was updated to v1.0.

6 Likes

This is a good point. I think I agree that fine-tuning is relatively low priority, and that the data side is the priority.

However, it’s not too difficult to tune multiple models once you have the tooling set up. @svilupp wrote about this a bit here, and he has already made a fine-tuning tool here. In principle, we could instead build a fine-tuning pipeline rather than focus efforts on a single model, which would alleviate some of the concerns about models moving too quickly.

I mean, IMO this is not a reason to not do something. I personally love learning random new things and I think it’d be a fun, unifying community effort to learn how to do something kind of weird and different.

Love this idea, :+1:

4 Likes

Thinking some thoughts…

As Cameron pointed out, you can find all the scripts necessary in the LLM Leaderboard repo.

I’m happy to organize a call and walk people through process.

Now that it’s set up, it’s pretty straight-forward so people could finetune their own models (assuming good data is available). The fine-tuning time is less than an hour end-to-end and it was like $0.5 on Jarvis.ai (by far the easiest set up for GPU poor like me).

When it comes to data collection, I’d say there are broadly two kinds:

  • RAW: source code, etc
  • INSTRUCTIONS: conversations/chats/tasks
    The former is often used early in training in the foundation model (huge volumes), the latter is used in instruction-tuning/RLHF (small, high-quality samples).

It’s a bad distinction but I wanted to highlight the main differences.

Assuming the goal here is to use it for tasks/code generation/etc, we can focus on compiling a smaller high-quality task dataset. It’s also much easier and cheaper :slight_smile:
(If we’re training an auto-completion model, the raw stuff would be more relevant, but we would need different models altogether to be fast enough and practical…)

To collect it, we could:

  • record all your conversations in PromptingTools (see save_conversation, which serializes to JSON). You can even do it automatically with the new TracerSchema, which wraps your normal schema and you can overload the finalizer to always save your messages automatically!)
  • record all questions asked in AIHelpMe (again, we can simply set up the “TracerSchema” to auto-record)
  • use samples in the LLM Leaderboard (filtered! That’s what I did for the “Cheater 7b” article A 7 Billion Parameter Model that Beats GPT-4 on Julia Code? - Julia Community 🟣)

… ?

To filter it:

  • I have a working prototype for an observability platform (to review a lot of serialized conversations for easier filtering)
  • We could probably do some clustering etc to understand the themes and if we’re too biased (LLMTextAnalysis.jl can help)

To fine-tune it:

  • It requires JSONL file in ShareGPT format, which is now a schema in PromptingTools. You can save a vector of vectors (of conversations) with save_conversations.
  • we can use the scripts for Axolotl. It’s very easy once you have the data!

In terms of next steps, is there an appetite for me to prepare the code snippets on how to auto-log your PromptingTools/AIHelpMe conversations?

2 Likes

Yes – I think we’d all appreciate that! I’d love to have a bullet point list of small things I can do to help out.

It would also be cool to download the Discourse, but we’d need an admin. Also, not sure about the ethics on this idea.

Here is a quick blog post with code snippets how to achieve the auto-saving in PromptingTools: Automatically Saving Conversations with PromptingTools.jl and AIHelpMe.jl - Julia Community 🟣

Once setup, you call everything the same way, but each convo gets saved in a global folder you picked.

Let me know if you have any questions / if it doesn’t work!

1 Like

You need to choose at a minimum what model to finetune, and Lama3 is outdated already, and (I’ve only scanned the Slack thread to that point) this one, was good:

Maybe this is useful ? bigcode/starcoderbase · Hugging Face

Arctic LLM seems best now (the Base model updated 2 hours ago) and/or Phi-3 (for a small one, also new), would now be on my short-list (also WaveCoder and its paper also, for “LLM-based Generator-Discriminator data process framework to generate diverse, high-quality instruction data from open source code”, and also hybrid Mamba/Transformer ajibawa-2023/Code-Jamba-v0.1 · Hugging Face still the only “Julia” tagged model on HF):

It’s a question is it better to start with a larger model (presumably better, but not always), or do they learn slower when fine-tuning? I.e. because of the size? You have a potential “catastrophic forgetting” with fine-tuning (for out-of-distribution data, so start with non-awful for Julia?), maybe not a bad thing forgetting the other languages Python etc…

One other thing maybe ruling out models is the tokenizer is fixed, some recent with about 30.000 possible tokens, and I’m thinking do they include the Unicode options Julia has? We want all the math operators supported/supportable…

In what sense?

Model variants are released on a many-times-daily basis, so there’s no catching the wave that way.

Selecting a base model is premature in any case. The work going into a project like this is in gathering and preparing the data set, once that’s collected it can be applied to any number of models. Just pick a big one and a small one which look promising when the data is ready, and spin them up.

5 Likes

I highly suspect Arctic LLM would be a good choice here. Generally, I’d not recommend continuing training (or fine-tuning) with a MoE architecture given a small size of raw dataset (and instruction dataset). Especially considering that Arctic only selects top2 among 128 experts (much more sparse compared to other similar models).

It really depends on the size of dataset we have. I’d recommend starting from a small-size model (<10B).

Agree.

Actually this is a very important distinction. I think we need more instruction data compared to raw data.

2 Likes

See e.g. Figure 1 and 2 and Table 1:

Llama 3 70B has comparable performance for “Enterprise intelligence”, basically coding, but takes over 16x longer to train (and thus Arctic likely also similarly faster to fine-tune on), because Arctic isn’t simply MoE, it’s a different/better such architecture. But it’s not Mamba or such a hybrid, I’m not sure Arctic has converged on the best architecture yet, a Mamba hybrid could do like it does, add dense to MoE, I thnk Arctic may be more about the good training data.

The high training efficiency of Arctic also means that Snowflake customers and the AI community at large can train custom models in a much more affordable way.

As seen in Figure 1, Arctic is on par or better than both LLAMA 3 8B and LLAMA 2 70B on enterprise metrics, while using less than ½ of the training compute budget. Similarly, despite using 17x less compute budget, Arctic is on par with Llama3 70B in enterprise metrics like Coding (HumanEval+ & MBPP+), SQL (Spider) and Instruction Following (IFEval). It does so while remaining competitive on overall performance. For example, despite using 7x less compute than DBRX it remains competitive on Language Understanding and Reasoning (a collection of 11 metrics) while being better in Math (GSM8K). For a detailed breakdown of results by individual benchmark, see the Metrics section.

Table 1 […] The training compute is proportional to the product of active parameters and training tokens.

  1. Architecture and System Co-design: Training vanilla MoE architecture with a large number of experts is very inefficient even on the most powerful AI training hardware due to high all-to-all communication overhead among experts. However, it is possible to hide this overhead if the communication can be overlapped with computation.

Our second insight is that combining a dense transformer with a residual MoE component (Fig 2) in the Arctic architecture enables our training system to achieve good training efficiency via communication computation overlap, hiding a big portion of the communication overhead.

  1. Enterprise-Focused Data Curriculum: Excelling at enterprise metrics like Code Generation and SQL requires a vastly different data curriculum than training models for generic metrics. Over hundreds of small-scale ablations, we learned that generic skills like common sense reasoning can be learned in the beginning, while more complex metrics like coding, math and SQL can be learned effectively towards the latter part of the training. One can draw analogies to human life and education, where we acquire capabilities from simpler to harder. As such, Arctic was trained with a three-stage curriculum each with a different data composition focusing on generic skills in the first phase (1T Tokens), and enterprise-focused skills in the latter two phases (1.5T and 1T tokens).

The “Code & SQL” part is growing up to 26.71% of the training data, in that 3rd phase, i.e. of 1T tokens (those are part of all the 3.5T tokens across the three phases).

They don’t say so, but to me it feels like that 3rd phase is the University/college phase… :slight_smile:

It really depends on the size of dataset we have. I’d recommend starting from a small-size model (<10B).

Then Phi-3 might be fine, but while Arctic has 480B MoE parameters, it’s only 10B dense plus 2*3.66B MoE = 17B active parameters, and training (and inference, I believe) scale with that. I believe it feels mostly like 17B; or even a 10B(?) dense model.

One other thing about the data set, ours and what already trained on. Julia is a 1-based (and column-major) language, and I think we want more of that, Julia, in the training data but maybe ok to add R, MATLAB, Fortran etc. in our fine-tuning for that reason, but none that differ much, like Python, Rust, C/C++. [Also want code with e.g. @view in the training data.] We want the model to really understand the difference with Julia/1-based and Python/0-based and row-major NumPy.

What the experts in MoE learn is syntax-based (hypothetically more), superficial, I don’t know maybe also indexing (and column-major?) in a heavily biased to programming LLM. When you train on different domains, e.g. multilingual what you want is that the model learns all the (natural) languages, but also doesn’t confuse them. I.e. the concepts behind the word/nouns. Not just syntax and grammar. With English usually dominant, it’s good, but also I see Icelandic reading comprehension is great (less good on writing it, getting better, despite it for sure still a tiny fraction of the training data).

What happens with the Chinese Qi is that it’s mostly Chinese, but only bilingual, and English I think 50% or at least a high fraction, and I hear it confuses the two, i.e. sometimes switches mid-sentence. I think for a programming model we don’t want that for e.g. 1-based vs 0-based, so what fraction should we have, no 0-based? We likely get away with only English, or multilingual, but then avoid bilingual.

[@ChrisRackauckas About specific domains, not just about programming (or natural) language, maybe add e.g. differential equation math (e.g. Julia code using the packages, not just themselves or their docs) or some other domain specific [math], even Julia code for web programming…]

Llama3 is not MoE based, and I thought the trend was to that (presumably explained by when they started training or influenced by Llama2 decisions); or a hybrid of some kinds, such as dense-MoE as with Arctic, or Mamba/Transformer, but while Llama3 is good it’s a pure transformer (I believe), why I consider it outdated. Not just best metrics, the 8B model beaten at least, the largest one not always (by other open source). We aren’t just choosing based on metrics, but what they could be, i.e. from where Julia stands currently.

You’re right, some work to do that, and maybe curate the training data (allow all of Julia discourse should be ok; and Slack etc.? And most/all packages), or generate some? I think we could start quickly, just by trying all the General registry code minus GPL, if we want that. I’m not sure how much us there vs what’s already in trained models. It’s also not just about quantity, rather quality.

For sure exclude that, i.e. have a Projects.toml file. There’s not much pre-1.0 code around, at least not (none?) in the General registry. Where else to get, unregistered packages? But at least no contaminate with old code. I thin we don’t need to draw the line at 1.6, since older should still work…

Yeah, but my point here is, (based on my experience) MoE is more data hungry than dense models. This means that we need more data to steer MoE like models to the desired direction. So if we want to evaluate the dataset collected here in an economical way, I’d prefer starting from a dense model with similar size activative parameters :wink:

This is a very good point. I believe this can be addressed with a LR re-warming, re-decaying and replay. See more details at https://arxiv.org/pdf/2403.08763 But unfortunately, nowadays LLMs are not that open source yet… Only very few released the datasets, training scripts, etc.

2 Likes

Does anyone have use cases where GitHub Copilot doesn’t do a good job?

That’s an intriguing paper 2403.08763 :

the distribution shift induced by new data typically results in degraded performance
on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data

This “replay of previous data” is problematic for us (and I also see another problem expanding on that idea, if we want growable models). We might not have the original training data in full (or in part). It’s likely we don’t need it (in full), only similar (distribution) data, but I don’t know for a fact it to be enough:

To simplify let’s say the model is trained on Python-0-based, and we want to add (more) Julia-1-based, then for fine-tuning we can do that, but over time if we to it too much or too often then it results in “catastrophic forgetting” of Python (might be a good thing…), but also of general English knowledge (bad). So we want to mix in “as little as 1%” of Python (and English) to our 99% Julia, to stop the forgetting.

I read a paper on growable (MoE) models recently. [The human brain needs to obviously grow over time (already rather large when born, has started learning something before…), but obviously can’t grow forever, tapers off and changes rather then continues to grow. So we might not need growable models after all, except possibly if it helps in the beginning.] From memory, that MoE paper didn’t strictly grow the model, or new experts, meaning adding more, but rather replaced a candidate candidate expert with one “grown”. Another paper was on merging a lot, maybe most, models available already into another (I forget if it grew or not).

1 Like

Everyone take a look at this! I’ll try and get mine set up this weekend.

@tylerjthomas9 do you have a repo for the code that did this? I’d like to add the formatter permutations.

Here you go: GitHub - tylerjthomas9/JuliaCode: https://huggingface.co/datasets/tylerjthomas9/JuliaCode