Overview
There’s been some discussion in the Julia slack about fine-tuning a model for Julia specifically, in part because the language does not have as large an existing code base as other, much older languages. There’s an existing Discourse post on it, but I wanted to make a separate thread for building a specific plan and hopefully gather some volunteers.
Goals
- Gather a high-quality Julia-specific training set.
- Publish the dataset to HuggingFace for use in future models trained by others.
- Produce a fine-tuned LLM specifically for Julia.
Specifics
Training data
- Retrieve all code posted in General, with an optional stretch goal for general Julia code that is in repos but not published in repos, but has tests and seems to function.
- Remove cases with incompatible licenses for fine-tuning – we want to respect licensure. I’m not sure what licenses exactly work, but this seems to suggest Apache 2.0, MIT, and the BSD licenses should work. Not sure what to do about repos that do not have licenses.
- Re-format all the files with each formatting style to increase the number of code samples. I’d imagine we should include the un-formatted code as well.
- Remove code for older versions of Julia, i.e. produced prior to Julia 1.0 or maybe even 1.6. Could try a few cutoff points.
- We could also consider synthetic data, where we ask an LLM to generate code. We’d then evaluate it to see that it works the way we want, and then include that in the training set. There is some debate about whether this is a good idea.
EDIT: I also wonder if just sneaking more high quality code into public datasets just gets the big providers to use the code. They’re always data hungry and just tidying up the data could just get us a long way.
Fine tuning
- There’s some discussion about whether we want next-token for code completion only (i.e. copilot-type stuff) or instruction tuned models (like Claude/ChatGPT, etc.)
- Code completion is the easiest.
- Instruction tuning requires that we have explanations of what’s going on in the code, and it’s a little out of my wheelhouse. I’d love to hear some advice here!
- We can use unsloth or HuggingFace. Probably one of the many other tools for this.
- Possibly provide an optional callback or telemetry function to AIHelpMe.jl to let users share instructions that are high quality – we’d stick these in a database somewhere and use them for fine-tuning.
@tylerjthomas9 I believe has done some work here, love to hear what kind of stuff you have.
Via @svilupp:
Re instruct dataset, yeah, other models are how everyone does - you build a pipeline where you generate&curate to have a high quality dataset (ie, duplicative stuff won’t do as much). I think Meta even said that’s how they did llama3 (using llama2 70b for the dataset prep)
Tooling
- Users can serve local models to VS Code with this extension or LM Studio via Ollama/vLLM/etc.
- We could provide a cloud service and give API key to people and maybe let them pay to use cloud services. Depending on the model we train, this could be fairly inexpensive, but it’s a larger infrastructure discussion.
- Any other comments here?
Next steps
- I want to hear from you all what you think!
- If we like the idea, I want to set up a community call every week or every other week to try and track who is doing what and what people need to move ahead.
Here’s the Slack thread for posterity.
Slack thread
Mateusz Kaduk 5 days ago
Are you aware for any “CoPilot” but Julia optimized? For example CoPilot or especially ChatGPT seems to be sometimes ignoring the 1-index based or col major, and often replicate the order of dimensions in an array as if it was PyTorch.
77 replies
SixZero 5 days ago
Actually I proposed an idea, to finetune a model which learns from Discourse/Slack #helpdesk channel conversations to finetune a model, which can help in the 9 out of 10 same questions coming to the platforms, but it was not loved that much, so I guess back then the community was not really happy for LLMs.
SixZero 5 days ago
I looked into Slack’s Terms of Use and according to that, I think saving messages for LLM train is actually allowed (I hope I am not missing any point in the terms of use).
Mateusz Kaduk 5 days ago
Can github repository of all repos with Julia code be used? Or would license not allow it?
SixZero 5 days ago
I think what is a public repo on github and licenses are free, I guess would be possible to be used. But yeah, I might be too “open” minded on this. (edited)
Mateusz Kaduk 5 days ago
Most repos should have LICENSE file attached that is standard. (edited)
SixZero 5 days ago
Yeah, we might separate out the ones which have no LICENSE, but I think that is still super big amount of code to be used for learning.
SixZero 5 days ago
Immagine asking something on Slack, and receiving an answer within 10 seconds in the specific channels we allow this bot to run. (Not waiting 1hour for someone to answer your question.) (edited)
SixZero 5 days ago
I wonder what accuracy we would get if it was only a RAG supported AI solution.
SixZero 5 days ago
Basically we would connect in @svilup’s AIHelpMe.jl package xD
Mateusz Kaduk 5 days ago
The major issue is column major and 1-index base for many standard algorithms, because they were implemented in another language. I am not sure if this can be completely solved with fine-tuning. I was thinking of something like JuliaCoPilot if it exsits. I know chatGPT allows creating “experts chats” which I suppose is some sort of tuning. But they also make these mistakes.
SixZero 5 days ago
I am pretty happy with gpt4.0 on this front, and some assistants can use gpt4.0 too. Btw I might be a little bit stupid somewhere here, because I don’t see why we are not utilizing LLMs more
Mateusz Kaduk 5 days ago
I think it would be cool project for Julia community putting all these APIs into practical use for Julia!
1
Cameron Pfiffer 5 days ago
I’m curious – does anyone have all the Julia source code? We could try fine-tuning one of the smaller models
1
SixZero 5 days ago
“all the Julia source code” hehe sounds funny
SixZero 5 days ago
I hope we don’t have to write some webscraper for this tho. (edited)
SixZero 5 days ago
Btw AFAI remember JuliaHub had some model, which was just for finding packages for us, right? But probably tht was RAG and not a finetune.
Cameron Pfiffer 5 days ago
Yeah it was RAG
Cameron Pfiffer 5 days ago
@aviks do you guys have a cache of Julia code somewhere?
Mateusz Kaduk 5 days ago
I meant majority of Julia packages are hosted on GitHub
Cameron Pfiffer 5 days ago
I am aware, I mean downloaded and stored somewhere.
Cameron Pfiffer 5 days ago
It’s a pain to go download them again if someone has already done so.
Avik Sengupta 5 days ago
We don’t have any dataset stashed, but I think I know someone who was working on curating something like this. I’ll try to see if they can chime in here.
Mateusz Kaduk 5 days ago
It would probably be curating that package registry GitHub - JuliaRegistries/General: The official registry of general Julia packages or having script iterating and git clone
1
SixZero 5 days ago
Correct me if I am wrong, but if we need the dataset we just need the list of repo urls and then with any script just download off them.
- So first get the list of packages we need. (I don’t know if all julia packages would make sense, or just the most popular ones)
- Download them
- We need a way to process the julia package format, like go over all files in src directory and doc/test/run folders too?
(edited)
Tyler Thomas 5 days ago
I have started working on this. I have a dataset that has all packages in the general registry (plus Julia base code) and any documentation that I could pull. I’d love to collaborate with anyone else interested in this (edited)
11
SixZero 5 days ago
I would ask whether we have some good idea on what the 3rd step could look like, is there a package, where we can define a protocol, on how to use a specific dataset, or something. I mean what if there were a system which could annotate automatically the textual data in some formats. I wonder if it makes sense, but:
- I think the train data should benefits from any annotation data that is associatable with the data. So each file should get the annotation in what file they are in, and in what package they are…
- I would guess popularity of packages would also be a somewhat good label for the data.
- What else could we come up with?
I don’t know if I am right here, but tell me if you guys think it is not how this is done. (edited)
Tyler Thomas 5 days ago
One side benefit from creating and maintaining a dataset would be that future foundation models would be more likely to have newer Julia data.As for your annotation data, we could do this with some chat template like chatml to make inference easier.I have also wonder if we formatted all the code to one style guide, would the downstream model be better at producing code that executes without syntax errors? (edited)
1
SixZero 5 days ago
What do you mean by formatting all the code? What formatting you meant?
Cameron Pfiffer 5 days ago
i.e. bluestyle or something
1
Tyler Thomas 5 days ago
All blue style or some other style from JuliaFormatter.jl
Cameron Pfiffer 5 days ago
I think it’s a good idea
Cameron Pfiffer 5 days ago
Then recs from the model would have consistent formatting
SixZero 5 days ago
I think, it should handle all formats what people use in real life. But if we would need more data then it is good way to augment the dataset. (edited)
Cameron Pfiffer 5 days ago
Oooo I see what you mean
Cameron Pfiffer 5 days ago
Like fork all the files and format them all
Cameron Pfiffer 5 days ago
Then num formats * num scripts = samples (edited)
11
Tyler Thomas 5 days ago
That’s an interesting idea. I’ll make my initial version of the dataset public this weekend incase anyone wants to experiment
Cameron Pfiffer 5 days ago
I’ll tinker. Never done any fine tuning before but I’d love to fuck around
SixZero 5 days ago
Same.
Tyler Thomas 5 days ago
I’d be happy to jump on a call and walk you through what I’ve done too
SixZero 5 days ago
I would want to have a good 3rd step solution, to create a way to make the data processable for different formats: like “packages”, but if a package directory has .html in it, then that part would be used as a html data… and so one…
Mateusz Kaduk 5 days ago
Aren’t there standardized pipelines and tokenizers for things like code that handle most of that stuff ? I have no clue but I would imagine there are ?
SixZero 5 days ago
But I am a little bit afraid of the fact, julia would perform not that good at text processing tasks xD
SixZero 5 days ago
I would love to know, because then we wouldn’t need to work this out xD
Tyler Thomas 5 days ago
We can tokenize any document in the GitHub repos and add it to the training data. I did this in the first pass with the markdown files like README.md
SixZero 5 days ago
this was the talk, on something similar to this. Slack
Eventually not just textual but images, sound and video formats could be handled by this.
SixZero 5 days ago
Yeah, tokenizing from text is straigth forward I guess, but how do we order the train data (or context). I might be going crazy but I would want a ready made solution for this.
Cameron Pfiffer 5 days ago
probably this Fine-tune a pretrained model
We’re on a journey to advance and democratize artificial intelligence through open source and open science. (34 kB)
SixZero 5 days ago
I heard from this guy this solution is -40% vram? and there are some efficiency improvements?
Google Colab
Tyler Thomas 5 days ago
Unsloth is great for single gpu finetuning (especially with qlora), which is probably all we’d need for this (edited)
Cameron Pfiffer 5 days ago
oh that’s sick
Mateusz Kaduk 5 days ago
Maybe this is useful ? https://huggingface.co/bigcode/starcoderbase
bigcode/starcoderbase · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science. (422 kB)
https://huggingface.co/bigcode/starcoderbase
Mateusz Kaduk 5 days ago
BigCode
As part of the BigCode project, we released and will maintain The Stack, a 6.4 TB dataset of permissively licensed source code in 358 programming languages, along with a collection of datasets created through the course of research during the project.
Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
Nov 16th, 2020
Mateusz Kaduk 5 days ago
There is Julia there as well
Tyler Thomas 5 days ago
I went through a couple of them, and there was less Julia code than my initial pull of the general registry code. I think it would be good to combine non package code. Also, a lot of the code is older, so newer Julia packages and features will be missing
Mateusz Kaduk 5 days ago
Yeah, I suspect, its probably like a matter of setting this up with the right data but maybe a good starting point.
Mateusz Kaduk 5 days ago
@Tyler Thomas regarding old code that is valid point, I think deprecated stuff should be excluded as it just leads to model giving bad advice of some functions that do not even exist anymore
Jan Siml 5 days ago
I love the energy!My 2c:
- what is the goal of the finetuning project? do you have some specific workflows or tasks that you’re targeting?
- are you planning to simply train on raw code? or use the raw code to create synthetic instruction dataset?
I have very limited expertise here, but I’ve been following a few fine-tunes for multilingual models (eg, german embeddings). In the end, the biggest lift was the curation of instruct dataset.The tasks/goal is crucial, because it will help focus the data curation + measure the progress. “Generic” model might be a death by thousand papercuts. At least in the start.As for the data, I presume we wouldn’t want to build a foundation model by doing next token prediction on source code (anyone has spare $1M? ). The most most common (and probably the only feasible path) is to finetune it, in which case we wouldn’t be pumping in raw script files, but rather highly curated conversations that reflect the behaviors/tasks/etc we want.
In my understanding, forcing all the source code would just undo a lot of the instruction tuning and we wouldn’t have the money to fix it (and expertise).In terms of the finetuning itself, that’s easy (especially at first, with things like LORA). I have good experience with Axolotl and can share the set up and all scripts to ultimately distil it into GGUF. The GPU compute was trivial and cheap on Jarvis.ai - I think I paid like half of dollar for a mini tuning job. But in general, once we have a good dataset, we can first test it a bit with some commercial finetuning service (OpenAI, Together etc) to see quickly how well it works without having to fiddle the loss curves and GPU utilizations
Cameron Pfiffer 5 days ago
- The goal is to have a Julia-optimized code model. Next-token is probably better here for now than instruction tuned, for exactly the reason you outlined.
- We would not do a new foundation model, that is silly – pick one of the open source code models and fine tune that.
It is a good question though – how do we get instruct data?
1
Cameron Pfiffer 5 days ago
Hm if only the slack history was not completely hidden…
Tyler Thomas 5 days ago
I think that we would want to finetune (with lora/qlora/…) several models. One with the fill in the middle objective on raw Julia code. This would be best for a copilot replacement that is hopefully better at Julia. The second would be the instruct model that you are talking about. I’m not sure what is the best way to get enough high quality examples. I could also see us wanting to finetune a model on next token prediction then instruct tune it.
Tyler Thomas 5 days ago
Maybe we could use llama 70B or an api to automatically generate instructions. This could at least give us a baseline
Cameron Pfiffer 5 days ago
I do actually think synthetic data here would be kind of cool
Cameron Pfiffer 5 days ago
We could also have a benchmarking/evaluation agent to filter down bad examples that are slow or don’t run
Jan Siml 5 days ago
I think compiling the instructions for some possible tasks are the most useful thing to start with.First, you can use it as is for finetuning and get good results.Second, if you do NTP, then you’ll have to do instruction tune anyway to still have a useful model (unless you’re building a code completion engine).My push is to focus, because we have neither the experience nor the resources, so we should start like everyone does and follow the tried and tested recipe.
(And, arguably, we’re really just trying to tweak the style/syntax towards Julia… adding knowledge via finetuning is usually discouraged)Re instruct dataset, yeah, other models are how everyone does - you build a pipeline where you generate&curate to have a high quality dataset (ie, duplicative stuff won’t do as much). I think Meta even said that’s how they did llama3 (using llama2 70b for the dataset prep)
Jan Siml 5 days ago
re evaluation and filtering, we would have some functionality and infrastructure already in place from the Leaderboard
Cameron Pfiffer 5 days ago
Sounds like AIHelpMe.jl is probably a great place to start for generating instruction tuned samples?
Jan Siml 5 days ago
Ideally, people would just use it and SHARE them
Jan Siml 5 days ago
that’s all we need
Jan Siml 5 days ago
for diversity
Cameron Pfiffer 5 days ago
Could add a telemetry option as well
Cameron Pfiffer 5 days ago
I.e. do you want to automatically submit this
Jan Siml 5 days ago
for quantity, we can generate higher quality faster but it would be too sterile (I suspect)
Jan Siml 5 days ago
Yeah, I’ve recently added callbacks so we could use it for that. Plus we can now serialize full RaGResult including context into JSON.
Also sent to the channel
Cameron Pfiffer 5 days ago
This all sounds very reasonable to me. I think my proposal of a next step would be put up an RFC on the Discourse about what we’re thinking and how best to do it. I’d love to have some kind of Julia bot that’s a little more up-to-date and syntactically aware
3
Mateusz Kaduk 4 days ago
Benchmark showing how copilot is bad at that vs how this is fixed by fine tuned model would be good to have.
Mateusz Kaduk 4 days ago
Maybe people contributing with problems they have with copilot when coding in Julia ? For example dot product attention is QK'
in row major languages like Python but for Julia it should be all transposed as it is column major language so K'Q
instead.