Fine-tuning an LLM for Julia, updates


This is a follow-up to this post on a few generative AI tasks for the Julia community, namely, producing a dataset of good Julia code and fine-tuning a model for Julia specifically.

There’s a few bits and pieces we’re working on now.

  1. Preparing a high-quality dataset of raw Julia code samples.
  2. Preparing a high-quality instruction dataset, which maps user requests to LLM outputs.
  3. Fine-tuning a language model specifically for Julia.

The JuliaGenAI org had a meeting last week (notes from @findmyway ) to discuss steps and figure out some kind of a plan for what people are going to work on. The primary goal currently is just to construct a dataset. The code for this is happening in the JuliaCode repo here, started by @tylerjthomas9.

An ideal outcome here is that (a) foundation models just become better for the community as others include the training dataset in their training steps, and (b) the community have some sort of access to a Julia model that can be accessed locally or hosted as a web service.

We’re trying to figure out distribution and may be able to have the client run the model in a browser, which would use user machines rather than expensive cloud inference services.

I volunteered to spend more time on the raw code and on the instruction dataset, with the goal of putting a dataset on HuggingFace and consolidating the Julia community around a single, high-quality dataset for model trainers to use.

Creating a dataset involves a few things. We currently have (mostly) all the code in the General registry formatted, so much of the simple code side is enough. I may try and figure out a way to do some pruning to focus on high-quality codebases like SciML.

Part of this requires some additional data that are spread around. A few sources I would like to have data dumps for:

  • The Discourse (this may be in Common Crawl)
  • Zulip
  • Rendered documentation, in markdown

Additionally, we can generate synthetic instruction data by asking a language model to produce questions and then answer their own question, with human validation and testing. We may also do some code eval to make sure that the code is compilable, runnable, and accurate for recent Julia versions.

I’m taking the fine-tuning course that everyone’s going on about, and I figured this would be a good use case for me to practice fine-tuning. I have an assload of compute credits and this seems like a great place to put them.

Anyway – comments welcome!