An LLM fine-tuned for Julia, call for comments + help

Thinking some thoughts…

As Cameron pointed out, you can find all the scripts necessary in the LLM Leaderboard repo.

I’m happy to organize a call and walk people through process.

Now that it’s set up, it’s pretty straight-forward so people could finetune their own models (assuming good data is available). The fine-tuning time is less than an hour end-to-end and it was like $0.5 on Jarvis.ai (by far the easiest set up for GPU poor like me).

When it comes to data collection, I’d say there are broadly two kinds:

  • RAW: source code, etc
  • INSTRUCTIONS: conversations/chats/tasks
    The former is often used early in training in the foundation model (huge volumes), the latter is used in instruction-tuning/RLHF (small, high-quality samples).

It’s a bad distinction but I wanted to highlight the main differences.

Assuming the goal here is to use it for tasks/code generation/etc, we can focus on compiling a smaller high-quality task dataset. It’s also much easier and cheaper :slight_smile:
(If we’re training an auto-completion model, the raw stuff would be more relevant, but we would need different models altogether to be fast enough and practical…)

To collect it, we could:

  • record all your conversations in PromptingTools (see save_conversation, which serializes to JSON). You can even do it automatically with the new TracerSchema, which wraps your normal schema and you can overload the finalizer to always save your messages automatically!)
  • record all questions asked in AIHelpMe (again, we can simply set up the “TracerSchema” to auto-record)
  • use samples in the LLM Leaderboard (filtered! That’s what I did for the “Cheater 7b” article A 7 Billion Parameter Model that Beats GPT-4 on Julia Code? - Julia Community 🟣)

… ?

To filter it:

  • I have a working prototype for an observability platform (to review a lot of serialized conversations for easier filtering)
  • We could probably do some clustering etc to understand the themes and if we’re too biased (LLMTextAnalysis.jl can help)

To fine-tune it:

  • It requires JSONL file in ShareGPT format, which is now a schema in PromptingTools. You can save a vector of vectors (of conversations) with save_conversations.
  • we can use the scripts for Axolotl. It’s very easy once you have the data!

In terms of next steps, is there an appetite for me to prepare the code snippets on how to auto-log your PromptingTools/AIHelpMe conversations?

2 Likes