An LLM fine-tuned for Julia, call for comments + help

Great, here’s the PR to format all the code. This should give us a ~4x sample size, though for some of the formatters the differences are going to be quite small. I also don’t filter out unique data which maybe we want to do here.

An additional thing we could do on the formatting side is to make the permutations of all the arguments to format_text:

format_text(
    text::AbstractString;
    style::AbstractStyle = DefaultStyle(),
    indent::Int = 4,
    margin::Int = 92,
    always_for_in::Union{Bool,Nothing} = false,
    for_in_replacement::String = "in",
    whitespace_typedefs::Bool = false,
    whitespace_ops_in_indices::Bool = false,
    remove_extra_newlines::Bool = false,
    import_to_using::Bool = false,
    pipe_to_function_call::Bool = false,
    short_to_long_function_def::Bool = false,
    long_to_short_function_def::Bool = false,
    always_use_return::Bool = false,
    whitespace_in_kwargs::Bool = true,
    annotate_untyped_fields_with_any::Bool = true,
    format_docstrings::Bool = false,
    align_struct_field::Bool = false,
    align_conditional::Bool = false,
    align_assignment::Bool = false,
    align_pair_arrow::Bool = false,
    conditional_to_if = false,
    normalize_line_endings = "auto",
    align_matrix::Bool = false,
    trailing_comma::Bool = false,
    trailing_zero::Bool = true,
    indent_submodule::Bool = false,
    separate_kwargs_with_semicolon::Bool = false,
    surround_whereop_typeparameters::Bool = true,
    variable_call_indent::Vector{String} = []
    short_circuit_to_if::Bool = false,
)::String

though I am not sure what people think about that.

Regarding fine tuning, it would be interesting to see if we could construct “backward prompts”, i.e ask a high-quality language model to give us questions to generate a particular script or part of a script. Anyone have opinions there too?

Hey everyone,

Just picking this up from having seen @cpfiffer post on Twitter. While not being useful for any alignment training training with RLHF/DPO/KTO yet, we have already released a pretraining-scale dataset which includes Julia as LLVM IR here are the paper, and the HuggingFace dataset:

You can pull just the subsection of it if you’d want to use it to seed efforts.

We are also going to release a version of Source Code → IR relatively soonish.

5 Likes

Yeah, that’s a common practice in creating evals. You can mimic the functionality in PromptingTools for that.

function: Reference for RAGTools | PromptingTools.jl

template: PromptingTools.jl

There is a blog on Forem on how to use it for RAG. The tweaking to get it to work on source code etc would be quite minimal.

btw a quick tip - JSON doesn’t have multiline strings even when pretty-formatted, so it’s painful to read and edit.
BUT if you use VSCode (Cursor, …), you can install a few extensions to make it a breeze - “multiline string editor” (Multiline String Editor - Visual Studio Marketplace) and “JSON multiline viewer”(JSON multiline viewer - Visual Studio Marketplace).

1 Like

Awesome!

Btw if people are lazy auto-saving their conversations in REPL, try using this GUI for LLM questions. It saves all conversations by default when you click “New chat”.

Disclaimer: I’m the author of the tool, but I do believe it can help since it requires no extra effort.

In the next few days or so, I’ll also publish a simple observability platform for LLM conversations (saved in JSON), so we can quickly review/filter/curate the saved conversations.

1 Like

That’s a very intriguing idea and paper. I’ve so far only scanned it, and it seems to me having register numbers in IR or assembly can obscure the meaning. I’m guessing LLVM IR was chosen since LLVM is a common backend.

Should @code_lowered, @code_typed down to @code_native be run on the Julia training code, and associated with it? Which, likely best, if not all? It seems problematic that Julia is generic, and there’s not just one set of types to run on, and you get different answers with different sets for each of those. I guess any one (or more) valid set could be ok.

I was also thinking of another language. I though stack-based Forth might be better (no register numbers), I’m not up-to-speed on much LLVM use for it, might not have lots of Forth code as training data, or users, since very old and people may not know of or care… but if you could compile Julia (or other language) to it then it might be good, and help the LLM see meaning in the (Julia) code.

I looked up what’s already being done with Forth, and I at least find some interesting papers:

A Neural Forth Abstract Machine

NEURAL PROGRAMMER-INTERPRETERS
https://arxiv.org/pdf/1511.06279

I don’t know if or how the following observation is actionable in this project, but: we don’t want an LLM emitting code which assumes arr[1] is correct, it should be arr[begin] and so on, zero(T) instead of 0.

The proper generic accessors are probably underutilized in a general Julia data set, it would be good to give code which does use them correctly some prominence.

3 Likes

Ah this is a good point, we would also want to add stuff that the linter supports, such as using for i in axes(thing). Anyone written anything that fixes these little things?

1 Like