I still haven’t really used LLMs to assist me on writing code.
One of my internal arguments was that since I code mostly in Julia, the LLM will not be half as good as assisting with more popular languages due to the dataset size.
Well, this might not be the case.
The following study evaluates the capacity of ChatGPT across several tasks for 10 different programming languages.
Julia, having admittedly smaller dataset, manages to make it to the top for some of the significant metrics.
(even more so as all data are limited until 2021)
Ofc, the results and the metrics just show a part of the truth and no absolute statements can be made. Thus, everything is a bit speculative.
That said, I would discourage people from using it as advertisement as we cannot really comprehend what’s going on underneath and it could be a local maximum
For example, it is to me still unclear whether the correctness of the generated code is properly tested, and that can make a whole difference.
But I found it nonetheless quite a success for Julia, given the reduced dataset size.
Certainly having good semantics to be easily used by LLMs was not a design requirement, but I like to think that it might have been accidentally achieved
I’ve been using Github Copilot (based on ChatGPT) for Julia in VS Code, and I’m frequently stunned at how good it is. One thing I’ve been surprised at is how well it reproduces the unicode that I use so much throughout my Julia code, including coming up with new combinations that I would use but hadn’t yet like ð̄²α₀. Another is how well it does with math. I recall one time I was writing a simple test involving a Hessian where the entries should just be constants, and copilot autocompleted the explicit matrix, even though I had never entered an explicit Hessian anywhere in my code base. It was actually missing one row, but otherwise most of the entries were correct. And it even named the Hessian with a cool unicode variable name like ∂ᵢⱼf or something.
I also did an experiment with my wife, who doesn’t really code but needed to do some novel data analysis on some genetics data. I had her step through the analysis, writing comments describing what the code should do, then enter a newline, wait a few seconds, and Copilot popped up the code. It wasn’t perfect, but it could get close enough that my wife could mostly fix the obvious errors. (I had to step in a few times to fix up some more subtle ones.) As part of the experiment, we did the same analysis in both Julia and Python, and Copilot did well in both cases — even with obscure manipulations in pandas, CSV.jl, and DataFrames.jl. I can’t honestly say that it was better with Julia than with Python, but it does very well. (Even writing up results in LaTeX is a little disconcertingly good.)
It’s still very much a situation where you must always check for correctness, but it can give you some good ideas and fill in a lot of boring boilerplate.
Really not surprising. I had been using ChatGPT for translating some examples back and forth for the development of diffeqpy, along with using it as a starting point for translating differential equation models back end forth for benchmarking. The generated Julia code is noticeably better than the generated Python code in terms of probability of correctness, not having a syntax error, and not being “just wrong”. And interestingly, it made the same mistakes I always saw students making with Python. That of course is just a quick eye test (I didn’t record data and it was just in a single domain), but it’s interesting to see that there’s no research that confirms this notion. I summarized this in a quick blog post: