Great, here’s the PR to format all the code. This should give us a ~4x sample size, though for some of the formatters the differences are going to be quite small. I also don’t filter out unique data which maybe we want to do here.
An additional thing we could do on the formatting side is to make the permutations of all the arguments to format_text:
Regarding fine tuning, it would be interesting to see if we could construct “backward prompts”, i.e ask a high-quality language model to give us questions to generate a particular script or part of a script. Anyone have opinions there too?
Just picking this up from having seen @cpfiffer post on Twitter. While not being useful for any alignment training training with RLHF/DPO/KTO yet, we have already released a pretraining-scale dataset which includes Julia as LLVM IR here are the paper, and the HuggingFace dataset:
Btw if people are lazy auto-saving their conversations in REPL, try using this GUI for LLM questions. It saves all conversations by default when you click “New chat”.
Disclaimer: I’m the author of the tool, but I do believe it can help since it requires no extra effort.
In the next few days or so, I’ll also publish a simple observability platform for LLM conversations (saved in JSON), so we can quickly review/filter/curate the saved conversations.
That’s a very intriguing idea and paper. I’ve so far only scanned it, and it seems to me having register numbers in IR or assembly can obscure the meaning. I’m guessing LLVM IR was chosen since LLVM is a common backend.
Should @code_lowered, @code_typed down to @code_native be run on the Julia training code, and associated with it? Which, likely best, if not all? It seems problematic that Julia is generic, and there’s not just one set of types to run on, and you get different answers with different sets for each of those. I guess any one (or more) valid set could be ok.
I was also thinking of another language. I though stack-based Forth might be better (no register numbers), I’m not up-to-speed on much LLVM use for it, might not have lots of Forth code as training data, or users, since very old and people may not know of or care… but if you could compile Julia (or other language) to it then it might be good, and help the LLM see meaning in the (Julia) code.
I looked up what’s already being done with Forth, and I at least find some interesting papers:
I don’t know if or how the following observation is actionable in this project, but: we don’t want an LLM emitting code which assumes arr[1] is correct, it should be arr[begin] and so on, zero(T) instead of 0.
The proper generic accessors are probably underutilized in a general Julia data set, it would be good to give code which does use them correctly some prominence.
Ah this is a good point, we would also want to add stuff that the linter supports, such as using for i in axes(thing). Anyone written anything that fixes these little things?
Apply for any model you use. It automatically logs necessary information (API kwargs, prompt templates and its versions, etc.) and then saves it to a disk (you can use LOG_DIR env or provide your own path). For docs see ?SaverSchema, ?TracerSchema