Is Julia Falling Behind in Relevance? (because it's not used in LLM research?)

Not sure what anybody can do about this. I can’t find extensive details on what the big LLMs are written in, but apparently ChatGPT is built on PyTorch and TensorFlow, both in Python on C++. That decision makes sense, you’d want proven, popular tools to get things done. But even assuming that could be accomplished with Flux or whatever, you’ll need the resources of a corporation to make an LLM good enough for consumers (who don’t need to care about the programming at all, as someone else pointed out). I don’t think any corporation can be convinced to rewrite their codebase now, especially when Julia is still improving basic features like precompilation. That doesn’t mean nobody will use Julia for machine learning, they do, it just means that the language itself isn’t the deciding factor for prominence, it’s money.

I think that if Julia developers fail to make a concerted effort to catch up in this area

Who are these “Julia developers” exactly, and what do you expect them to do? There’s nothing about the Julia language as such that would make it unsuitable for LLM research. The problem is that LLM is currently driven by a handful of corporate research labs (Google, Facebook, OpenAI), and those labs have legacy codebases in C++ with public Python frontends. So if someone with comparable resources started to use Julia, you’d certainly see Julia “catch up”. Would I like to see that? Sure! But you can’t exactly force anyone to switch to Julia. Would I like to have Julia Computing or core developers to advocate for Julia’s virtues in the LLM space? Definitely, but it’s not really up to them.

I’d also think that in an academic environment, for someone starting “from scratch” and with a small team, Julia could be a highly productive language for LLM research. But again, it’s up to individuals involved in LLM research to start using Julia and to create and maintain libraries in this space. Even then, though, anything coming out of academia will likely not be “competitive” with heavily funded corporate research.

So, if you like Julia and do LLM research, start contributing to / maintaining the Julia packages in this space, and advocate Julia to your colleagues.

15 Likes

Quality does matter though. eg JAX vs Julia (vs PyTorch) · Patrick Kidger chose Python over Julia for quality reasons.

3 Likes

I think the future will see LLM’s with specialist ‘knowledge’ (for lack of a better word). These will use the larger pre-trained models (trained in python, etc) as ‘base’ models, which will be ‘tilted’ towards specialist applications and priorities. My view is that if Julia is not capable in this area, it will slowly be relegated to being a ‘niche’ player.

The criticisms in that blogpost are all valid, but you’re missing the implication that money affects quality, too. JAX is a Google project, Flux is a community project. JAX’s listed contributors outnumber Flux’s, 505 to 213. On top of that, JAX’s contributors are paid by one corporation to focus on JAX, whereas Flux’s contributors are scattered across organizations where Flux isn’t often their main job. I’m not surprised that a user would run into fewer issues when there are more developers dedicated to finding and fixing those issues.

1 Like

I don’t really understand what you’re talking about here, and it seems like so far everyone else hasn’t considered this either. AFAIK people just use the same tool and custom datasets to fine-tune a pre-trained model, I’ve never heard of anyone using a different tool and language to do that.

1 Like

Wait and see.

I tried this. None of the examples work now [maybe needs updating?]

1 Like

Maybe focus on making PyCall better and copy-free sharing of data more seamless.
It may not be realistic to duplicate the amount of infrastructure going into Python.

2 Likes

I think computers will find some way of communicating with us, provided they will see a need to do so.
Probably by embedding electrodes in our brains.

That is really cool, but I think Julia could begin building up infrastructure like it has with everything else - much more efficiently and elegantly, having learned from everyone else’s mistakes (and red-herrings) on other platforms that came before.

Maybe worth an issue if you can provide a reproducer. I tried the example in the package readme and it worked fine.

The error message did mention CUDA, which may have to do with the fact that I am running on an M1 Mac

What is it that doesn’t work?
Just tried the “AttentionIsAllYouNeed/copy” from the examples folder, and seems to train well.

Julia is on Google’s radar, including for LLMs, is benchmarked with highest capability jump in capability in their brand-new state-of-the-art language LLM (and intriguingly PaLM 2 is a smaller model that predecessor, and, still better; the trend of models has been going larger, then smaller in open source world, and recent commercial models silent on size, this one first breaking that rule/stating smaller?), probably current best model for code-related tasks, as of this week:

Through extensive evaluations on English and multilingual language [also programming languages, e.g. Julia], and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. […]
Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models.

Julia shows largest jump in capability, i.e. with PaLM 2-S vs PaLM-540B-Coder:

Figure 6: BabelCode-HumanEval results on 12 programming languages in the pass@1 setting. The Python results are not directly comparable to standard HumanEval due to differences in the evaluation procedure. Raw numeric results are shown in Table 18

Results are shown in Table 8. PaLM 2-S* outperforms PaLM-540B-Coder on all benchmarks, often by a significant margin (e.g. ARCADE), despite being dramatically smaller, cheaper, and faster to serve
[…]
Multilingual Evaluation We also evaluate PaLM 2-S*’s multilingual coding abilities using BabelCode (Orlanski et al., 2023) which translates HumanEval into a variety of other programming languages, including high-resource languages like C++, Java, and Go and low-resource languages like Haskell and Julia.

PaLM 2-S* outperforms PaLM on all but two languages [i.e. Go and C# go down], with surprisingly little degradation on low-resource languages like Julia and Haskell; for instance PaLM 2-S* improves upon the much larger PaLM-Coder-540B by 6.3× on Haskell and on Julia by 4.7×. Remarkably, Java, JavaScript and TypeScript performance is actually higher than Python, the original language.

The report is actually wrong with Haskell having the 4.7x increase, and Julia 6.76x increase (or 3.86x vs PaLM 540B not PaLM-Coder-540B), and Python has less than 2x gain. Note, Julia absolute score is only at 49% of the score of the highest scoring languages; both Python and C++ get 34.16, since they are “high-resource”, I read that as having more training data. Julia gets 79% of Go’s best score (actually older score, actually 87% of Go’s PaLM score, what will likely go into production, I think it already is), and surprisingly Lua above Go. Julia is 93% above Haskell, and yes, those two languages are ranked the last two by absolute measure of those 12 ranked.

It’s an interesing model:

B.1 Multilinguality
Explaining jokes […] We show that PaLM 2 exhibits joke understanding capabilities in a multilingual context. We instruct PaLM 2 in a zero-shot setting and provide examples in Figure 12.
Explaining translation ambiguities PaLM 2 exhibits more nuanced translation capabilities and is able to explain the rationale behind translations.
[…]
B.3 Coding
We show samples of PaLM 2 coding capabilities. In Figure 26, we show an example of PaLM 2 designing a simple website. PaLM 2 demonstrates coding capabilities also in a multilingual setting. Figure 27 shows PaLM 2 fixing a bug with line-by-line comments in Korean. Figure 28 provides an example where PaLM 2 generates a function and usage examples with comments in Malayalam.
[…]
Sure, here is the same text in Badisch:
Großi Sprachmodell (LLMs) sin e Art vu künstlicher Intelligenz […]
Figure 20: Example of asking for a simple explanation in German and in Badisch, a German dialect.

I’m not sure what’s the best metric to rank on, I see CodeBLUE from 2020: “absorbs the strength of BLEU […] and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.”

http://keg.cs.tsinghua.edu.cn/codegeex/

there are 23.9 million professional developers in 2019, and the population is expected to reach 28.7 million in 2024. […]
When developers want to find code written by others with the same intent, code search systems can help automatically retrieve semantically relevant code given natural language queries. When developers are confused about what to write next, code completion systems can help by automatically completing the following tokens given the context of the edits being made. When developers want to implement Java code with the same function of some existing body of Python code, code-to-code translation systems can help translate from one programming language (Python) to another (Java).

Code intelligence therefore plays a vital role in Microsoft’s mission to empower developers. […]

However, the area of code intelligence lacks a benchmark suite that covers a wide range of tasks. […]

To address this, researchers from Microsoft Research Asia, Developer Division, and Bing introduce CodeXGLUE, a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios: […]

A brief summary of CodeXGLUE is given below, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task.

A lot of important AI research has been reimplemented, e.g. NeRFs have, with Nerf.jl, see about them here: https://arxiv.org/pdf/2210.00379.pdf

Google Research has three Julia packages, none registered, e.g. this one:

pkg> add https://github.com/google-research/FirstOrderLp.jl

It does have an update 4 months ago for 1.7, seemingly a small change, but concerning part of the commit: “Delete Julia version 1.8 from CI tests (the Manifest version of CSV doesn’t built for 1.8)”

And yes it states “This is not an official Google product.” (unlike their “generic-adaptive-restarts” which is also 30% Python; the 3rd project is also 100% Julia but I couldn’t add/clone it).

8 Likes

At their description for the public:

It excels at advanced reasoning tasks, including code and math

The link for “excels” (synonym for state-of-the-art?) in that sentence to the TR: https://ai.google/static/documents/palm2techreport.pdf

It may (or not) be best for code in general, but it doesn’t say it’s best for Julia. Relevant to that I find this section important:

2 Scaling law experiments
Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. […] they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.

In this section, we independently derive scaling laws for very large models. We arrive at a similar conclusion as Hoffmann et al. (2022), i.e., D and N should grow in equal proportions. We then explore the effect of scaling laws on downstream metrics.

What does that mean for laymen? Since there is less code out there for Julia, that can be training data, you might think smaller models better, than for say Python. However you want to train on some English natural language corpus with it, and any codebase size is likely dwarfed by that. Plus you want multilinguality (not just for natural languages), likely Python too, to be able to translate from, so Julia only an addition…

1 Like

I tried PythonCall a while back; I found it less stable than PyCall, and doesn’t have the py"" macro that executes pure python code. I tried mixing them but just ended up using PyCall only.
You can get copy-free with PyCall, but it’s convoluted and easy to unintentionally make copies.

1 Like

It’s the man-power issue. There was a tensorflow wrapper but even the wrapper couldn’t keep up with developments.
Julia runs loop code well, is math friendly. It should probably focus on augmented libraries with Python core components. It’s back to two languages but with much faster and type-safe core language.

1 Like

Side note on PythonCall: there is an equivalent to the py"""...""" macro but it’s a bid hidden: Define Python function in Julia file with PythonCall? - #7 by cjdoris

4 Likes