Bend: a new GPU-native language

I thought this new language looked cool: GitHub - HigherOrderCO/Bend: A massively parallel, high-level programming language. It’s still in the research stages but presents some interesting ideas.

It compiles a Python-like language directly into GPU kernels – more than just array broadcasting – by exploiting parallelism wherever it can:

I wonder if Julia could eventually try to do something like this.

They have a paper on it here: https://raw.githubusercontent.com/HigherOrderCO/HVM/main/paper/PAPER.pdf

The underlying compiler is here: GitHub - HigherOrderCO/HVM: A massively parallel, optimal functional runtime in Rust

So compiling to this target “HVM”, a language can exploit their GPU-ification.


Edit: changed title so its clearer that HVM is GPU-native, not just CPU threads

17 Likes

I remember someone pointed out that it’s partly due to their baseline being very slow.

Do you have a link? I read through the HackerNews thread where someone commented on this number being low but it seems like they didn’t realize that 1 GPU core << 1 CPU core

I’m doubtful. I don’t think a language that can automatically allocates a thread every time something gets executed would be optimal. Spawning a thread has an overhead. Locking and unlocking has an overhead. If your language is slow to start with on an each thread, then you can naively speed these up with parallelism. The hard part of many parallel algorithms is not sprinkling in a bunch of “spawn”, or “lock”, it’s designing the algorithm to be parallelizable and perhaps lock-free and so on.

4 Likes

The HY comment points out that python single thread is much faster than Bend single thread, and that pypy is much faster than Bend GPU.

I think this doesn’t invalid the idea of something like Bend, but if the base language is fast (Julia) it’s very hard to automatically find better parallelism algorithms without being slower for some problems sizes.

For example, all(predicate, array) can be parallelized, but depending on problem size and probability of early termination, how would the language decide if it should run in parallel or not

We’ve had quite a few projects on automating different forms of parallelism. The issue is not whether you can parallelize things in a compiler, it’s whether you want to. Solving the scheduling problem for whether a given form of parallelism will actually make things faster or slower is hard on shared memory since threads have a non-trivial spin up cost, and so you need some pretty good cost models. It’s even harder for something that’s multiprocessed or GPU because you have to factor in memory transfer casts.

Whether parallelism is a good idea is not necessarily fully captured via the program either, you need to know the size of data in many cases in order to know if it makes sense. For example, for a large enough matrix then sending it to the GPU, computing there, and sending it back, will beat out even the best CPU. Whether you should even multithread at all is dependent on if it’s too small (you cannot multithread an 8x8 matmul and expect to win). So this means you need dynamic scheduling, and a dynamic scheduler itself has overhead. If you’d want to mask that overhead, then you need a compiler which can figure out what places you may want to have a more streamlined computation (like an inner loop) and pull it out of there, but not all inner loops because then you’d never multithread a linear algebra implementation.

One other factor to keep in mind is that the choice of scheduling is not something with a unique solution. Most frameworks that “auto” parallel in some form make some kind of choice as to how to perform the scheduling in some set manner. For example, vmap is an “auto parallelism” construct in many machine learning libraries which, while it’s good for linear algebra, it’s quite obviously a bad idea for kernels not dominated by linear algebra which is how we can show it’s (PyTorch and Jax’s implementation) 20x-100x slower than doing the right thing.

https://www.sciencedirect.com/science/article/abs/pii/S0045782523007156

Not very surprising of course when you consider how vmap chooses to schedule, but of course that shows you that even a billion dollar company cannot beat a normal human taking into considering how a compute should be happening.

Finally, whenever I hear this kind of discussion, I always think back to the NetworkX vs Lightgraphs.jl early discussions. People lamented for awhile that it would be hard to catch up to NetworkX and all of its parallelism goodies, back before Julia had multithreading. But someone did an independent benchmark and…

Well that’s the v2 of it, but same results. Basically, NetworkX has a bunch of papers talking about how great its parallel scaling is, but it’s about 30x slower than a single core implementation for most size graphs you can fit onto a shared memory machine.

Parallelism and scaling is a good property to have, but you cannot act like a reinforcement learning algorithm and just optimize scaling at all costs. Remember, the goal is to make codes faster, not more parallel.

30 Likes

I know this is kinda ironic in a Julia forum, but I wish people would stop making comparisons to Python when the langauge is not even similar enough to be a subset or a derivative. You don’t annotate types and the function keyword is def, but there’s no for or class, which may be reasonable high-level limitations to allow parallelism). From their incomplete paper, Python and Haskell are potential targets for compilation to the intermediate HVM2, Bend is just the language they could demonstrate for now.

2 Likes

I think some people in this thread may be missing the key innovation/coolness here and instead projecting it onto existing libraries.

This is not vmap. This is a language that, in its entirety, can generate efficient CUDA kernels, parallelizing any operations which can be run in parallel.

The only thing I’ve ever seen close to this is JAX, which via XLA can fuse CUDA kernels, but in JAX you are severely limited in what you can do, as you are basically using Python for meta-programming C++.

But this is a full compiler that can generate massively parallel GPU kernels from high-level code. I haven’t seen anything like this before.

And don’t think of this as a competitor; the “bend” language is just an example implementation, but HVM2 is something that Julia could actually use (or maybe a macro).


Anyways also just to point out, it’s not optimized yet, it seems like a research language which just got off the ground. i.e., you don’t want to naively compare performance numbers with other languages yet. The scaling is what matters.

7 Likes

That seems to not address anything that was mentioned above. vmap is one of many forms of parallelism. Others like Cilk are already built into Julia in some form. Remember, the precurser to Julia in some sense was StarP, a parallel MATLAB compiler

Early projects such as the ParallelAccelerator subset (made by Intel):

before it was abandoned due to Tapir.

Out-of-core scheduler projects still exist such as Dagger:

It’s not difficult to setup an overlay table so that every function call turns into a Dagger.@spawn call and is thus handled by a scheduler. That would effectively give you Bend. The reason why people don’t do that is because you’d just get slower code in most scenarios because of scheduler overhead.

Julia has been taking a step-by-step development to get there, starting from single-core performance and then really focusing on multithreading performance. Then extending the scheduler to support distributed scheduling with an appropriate cost model. You can dig up older documents that outline exactly the plans:

But of course, we’re still in the phase of optimizing Tapir. You’ll notice some familiar names:

Probably the most recent step towards this was:

Basically APIs that let someone opt functions into may-happen parallelism.

The part of this which ended up getting deployed the soonest was ModelingToolkit, since that took a lot of the principles and then applied to do a domain-specific space where more assumptions could be made to simplify the problem. This was the actual first topic of the symbolic-numeric toolset:

The non-trivial part would be the cost model, which would need to be heavily tailored to the language. I don’t know how much you’d gain by pulling the rest of the scheduler along.

9 Likes

I still don’t see anything related to GPUs though? Dagger.@spawn does not give you CUDA kernels… CUDA.jl still involves writing CUDA-like code.

The GPU part is what is so crazy about HVM. You write high-level code – which looks nothing like a CUDA kernel – and get PTX out. It is not limited to vectorized array operations either.

I think you might be underestimating things a bit here… Author says they’ve been working on this for 10 years – x.com. HVM2 is the fourth iteration of their framework. It’s really sounds not so easy to do this.

9 Likes

What is the line you’re drawing here exactly? CUDA C++ is technically high level code compiled to PTX. They got tutorials of traversing and generating trees too.

But we are on a Julia forum…

I guess maybe you are saying “high level” is a relative concept, but we are speaking relative to CUDA rather than relative to PTX.

Not to mention there are many things which are extremely difficult to express in CUDA (on a GPU in general I suppose) or with a library of array operations. Which is where something like HVM is proposed to help.

1 Like

According to the HVM2 paper:

In single-thread CPU evaluation, HVM2, is, baseline, still about 5x slower than GHC, and this number can grow to 100x on programs that involve loops and mutable arrays, since HVM2 doesn’t feature these yet.

It’s a shame that so many interesting projects in the Julia ecosystem don’t get more promotion. Fortunately, Calculus with Julia has been promoted today in Hacker News with great success.

5 Likes

Well Julia v1.0 was also the 4th iteration in some sense, going back to Star-P in 2004, Julia initial release, Julia experimental Distributed and Multithreading, then finally Tapir-based parallelism and constructs now going beyond that. So we’re going on 20 years :sweat_smile:. I don’t think there’s much point to such counting, but this ain’t the first rodeo around here.

Julia code can compile down to give custom kernels in CUDA.jl. There’s then abstractions written on top of that with tools like KernelAbstractions.jl

For example, the matmul kernel is effectively just Julia code with an added piece for describing the global index:

@kernel function matmul_kernel!(output, a, b)
    i, j = @index(Global, NTuple)

    # creating a temporary sum variable for matrix multiplication
    tmp_sum = zero(eltype(output))
    for k = 1:size(a)[2]
        tmp_sum += a[i,k] * b[k, j]
    end

    output[i,j] = tmp_sum
end

With Bend if I’m not mistaken you still have to figure out how to represent the code using bend and fold constructs. Note this is pretty close to what I was linking to before with Tapir extensions, where Taka’s proposed Tapir extensions were coupled with transducer-type parallelism approaches.

You can see a demonstration of it here:

https://juliafolds.github.io/data-parallelism/tutorials/quick-introduction/

And it includes a FoldsCUDA.jl for a version with CUDA support. The idea was to integrate the DAG construction into the compiler (that’s the PR) and give a macro so users can opt-in easily (as opposed to being fully parallel, so that the general scheduling problem does not have to be solved).

But there are some issues with the bend/fold approach. It’s not new, and Guy Steele is probably the person you can watch who has had the most on this. His early parallel programming language Fortress is one of the ones that heavily influenced Julia’s designs. He has a nice talk on the limits of such an approach:

And was a keynote at an early JuliaCon:

This is a particular project that I would be interested in reviving if I ever find the right person and grant funding for it (though I’m not the right person to do the day-to-day reviews on it :sweat_smile:). I think integrating bend/fold with opt-in may-happen parallelism (i.e. opt into allowing the compiler to choose parallelism or not, and how) would be a nice thing to have, though I’m personally skeptical of the number of codes I have that it could match my parallelism on so I tend to keep this stuff off “the critical path” for now.

5 Likes

Maybe the missing piece here: bend works. It’s a full programming language that can compile to correct CUDA kernels. It’s not a limited set of constructs that are already easy to write kernels for (FoldsCUDA.jl) or a CUDA wrapper (CUDA.jl), but rather a full GPU-native language that looks like regular code, and it currently generates correct PTX. (and as the saying goes, the last 20% takes 80% of the time)

Of course the ideas for this are not new, the author themselves cites a 1997 paper for the design. But the fact they actually got it working is novel.

I’d love to be mistaken of course, but last I checked I still need to write CUDA kernels for any operations that doesn’t fit nicely into broadcasting or reduction. While CUDA.jl lets you use Julia inside the kernel, it still needs to look like and act like a regular CUDA kernel. And with CUDA kernels there are some things that are hugely complicated to do well, to the point I don’t even bother trying.

1 Like

Ehh, marketing. Read the actual docs, not the marketing material. Bend takes lineage from those kind of DAG-builder approaches but mixes some of Clojure’s pieces in:

Where all data structures are immutable and recursive

Finally, let’s get straight to the fun part: how do we implement parallel algorithms with Bend? Just kidding. Before we get there, let’s talk about loops. You might have noticed we have avoided them so far. That wasn’t by accident. There is an important aspect on which Bend diverges from Python, and aligns with Haskell: variables are immutable . Not “by default”. They just are . For example, in Bend, we’re not allowed to write:

Which is immutable. If that sounds annoying, that’s because it is . Don’t let anyone tell you otherwise. We are aware of that, and we have many ideas on how to improve this, making Bend feel even more Python-like. For now, we have to live with it. But, wait… if variables are immutable… how do we even do loops?

The solution is of course, transducers. Wait, I mean bend and fold.

Fold: consuming recursive datatypes
Bend: generating recursive datatypes

Bend and fold are just transducers. Transducers.jl comes from the same place.

Transducers.jl provides composable algorithms on “sequence” of inputs. They are called transducers , first introduced in Clojure language by Rich Hickey.

JuliaFolds then built constructs on top of those because most people find writing transducer-based programs quite hard to do:

So then it becomes a skeleton-based parallel framework on top of a transducer-based immutable programming layer. Which again I mentioned before as one of the research directions that was ongoing (that I would like to revive). One of the things required to make this more performant in Julia (and thus the barriers for now) include optimizations for immutability and improved escape analysis, which are some major projects right now with the new Memory type of v1.11 and to re-building of constructs on that piece.

So… “rather a full GPU-native language that looks like regular code”, if your regular code only consists of immutable recursive data structures manipulated without loops but instead by transducers then yes, its a full language that looks like regular code! I happen to use loops from time to time, and arrays, and heaps, and some other non-recursive structures. So to me, this is not “regular code”

This of course was the same major adoption barrier to Clojure.

No it doesn’t. Parallelism works when it speeds things up. It currently doesn’t speed up code.

But there’s also many projects like this. DawnCC was a C compiler that attempted to automate parallel constructs:

https://homepages.dcc.ufmg.br/~fernando/publications/papers_pt/Breno16Tools.pdf

Bones was another such project:

It did well on codes that were amenable to a loop analysis but had difficulties speeding things up that were more general. These types of parallelism models have been called “Skeleton-based programming models”. Other languages to look at in this space include:

https://skelcl.github.io/

https://muesli.uni-muenster.de

https://parallelme.github.io/

There’s a project from NIST that I’m missing too…

Most are skeleton-based because that’s usually a good way to get something working (i.e. fast enough to do some example well) before taking off to be a whole thing.

Probably the thing I would point to that worked the most in this space so far is hiCUDA, which if you read their papers you’ll see very similar language:

We have designed hiCUDA,
a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks
in a simpler manner, and directly to the sequential code.

Its promise was that you could write something pretty close to normal C++ but it would automatically construct the parallel code. The main difference from Bend is that it required you control memory allocations so that you could enforce locality, as this was pretty necessary for the “compiler-guessed CUDA” to get close to the handwritten optimal CUDA. But getting the data handling right is hard, so it was an interesting thing to just opt-out of that part and only focus the automation on the code under the assumption the user will locate the data properly.

There have been many codes to automatically construct DAGs to be alternatively compiled. Those projects have effectively been thrown away because they weren’t actually fast, just like Bend. The pieces that you actually see shared and remaining are pieces with opt-in parallelism because again, anything that was not opt-in was not able to guarantee it wouldn’t slow down normal execution. Removing the opt-in nature is a choice, not a technical hurdle. Just no one has found a good enough solution to the scheduling problem to justify moving to a more automated interface. And clearly Bend hasn’t solved that problem either.

15 Likes

FWIW I’m experimenting with an XLA-based Julia fuser in GitHub - EnzymeAD/Reactant.jl

11 Likes

The most exciting (to me) approach to high level GPU programming is based on, SOACS https://futhark-lang.org/

I suspect that some of the futhark compilation rules could be used in a Julia dedicated package but I did not find time (yet) to explore this.

2 Likes

@ChrisRackauckas thanks for all of these links. Of course Transducers.jl and Folds.jl are awesome but they definitely constrain what sorts of code you can write. I think most things possible on GPU via Transducers.jl and Folds.jl would be fairly easy to write as explicit CUDA kernels already (and likely faster when written in explicit CUDA).

Again, the GPU-ification of arbitrary high-level code is the cool part.

Imagine writing a compiler that could be run natively on GPUs. That seems impossibly difficult in CUDA and CUDA wrappers like CUDA/CUDAFolds/etc, but seems possible in a high-level framework like this.

I would push back on this as Bend seems to already get good scaling:

It only got released two days ago though. I’m just excited it seems to scale like this at all!

I took a look through these but (a) the GPU-compatible ones look highly-constrained in terms of language features, and (b) the more flexible ones seem incompatible with GPUs. Bones looks the most related but seems like it is missing some language features, and also it has been abandoned for 10 years, sadly.

Thanks, looks useful!

Thanks, also looks cool. Although I will note this language declares itself to be an “array language” (which means the CUDA kernels are likely easy to write by hand in many cases)

1 Like

Why don’t you say Bend also constrains the codes you can write? It has the same constraints as folds and transcuders, right? The difference is that Bend is a language where only fold and transducer constructs exist, while Transducers.jl is a system inside of Julia which requires that you use only folds and transducers to get the parallelism. But given they have the same constraints (which I pointed exactly to the points in the documentation which say this), why would you not say Bend also constrains the way you write code?

Scaling isn’t performance. Remember, NetworkX is more scalable than LightGraphs.jl, there just happens to be no size graph that fits on a modern computer for which NetworkX is faster.

5 Likes