Bend: a new GPU-native language

ChrisRackauckas · May 19, 2024, 3:24pm

Ehh, marketing. Read the actual docs, not the marketing material. Bend takes lineage from those kind of DAG-builder approaches but mixes some of Clojure’s pieces in:

Where all data structures are immutable and recursive

Finally, let’s get straight to the fun part: how do we implement parallel algorithms with Bend? Just kidding. Before we get there, let’s talk about loops. You might have noticed we have avoided them so far. That wasn’t by accident. There is an important aspect on which Bend diverges from Python, and aligns with Haskell: variables are immutable . Not “by default”. They just are . For example, in Bend, we’re not allowed to write:

github.com

HigherOrderCO/bend/blob/main/GUIDE.md#the-dreaded-immutability

Bend in X minutes - the ultimate guide!
=======================================

Bend is a high-level, massively parallel programming language. That means it
feels like Python, but scales like CUDA. It runs on CPUs and GPUs, and you don't
have to do anything to make it parallel: as long as your code isn't "helplessly
sequential", it **will** use 1000's of threads!

While cool, Bend is far from perfect. In absolute terms it is still not so fast.
Compared to SOTA compilers like GCC or GHC, our code gen is still embarrassingly
bad, and there is a lot to improve. And, of course, in this beginning, there
will be tons of instability and bugs. That said, it does what it promises:
scaling horizontally with cores. And that's really cool! If you'd like to be an
early adopter of this interesting tech, this guide will teach you how to apply
Bend to build parallel programs in a new way!

For a more technical dive, check HVM2's
[paper](http://paper.HigherOrderCO.com/). For an entertaining, intuitive
explanation, see HVM1's classic
[HOW.md](https://github.com/HigherOrderCO/HVM/blob/master/guide/HOW.md). But if

This file has been truncated. show original

Which is immutable. If that sounds annoying, that’s because it is . Don’t let anyone tell you otherwise. We are aware of that, and we have many ideas on how to improve this, making Bend feel even more Python-like. For now, we have to live with it. But, wait… if variables are immutable… how do we even do loops?

The solution is of course, transducers. Wait, I mean bend and fold.

Fold: consuming recursive datatypes
Bend: generating recursive datatypes

Bend and fold are just transducers. Transducers.jl comes from the same place.

Transducers.jl provides composable algorithms on “sequence” of inputs. They are called transducers , first introduced in Clojure language by Rich Hickey.

JuliaFolds then built constructs on top of those because most people find writing transducer-based programs quite hard to do:

So then it becomes a skeleton-based parallel framework on top of a transducer-based immutable programming layer. Which again I mentioned before as one of the research directions that was ongoing (that I would like to revive). One of the things required to make this more performant in Julia (and thus the barriers for now) include optimizations for immutability and improved escape analysis, which are some major projects right now with the new Memory type of v1.11 and to re-building of constructs on that piece.

So… “rather a full GPU-native language that looks like regular code”, if your regular code only consists of immutable recursive data structures manipulated without loops but instead by transducers then yes, its a full language that looks like regular code! I happen to use loops from time to time, and arrays, and heaps, and some other non-recursive structures. So to me, this is not “regular code”

This of course was the same major adoption barrier to Clojure.

No it doesn’t. Parallelism works when it speeds things up. It currently doesn’t speed up code.

But there’s also many projects like this. DawnCC was a C compiler that attempted to automate parallel constructs:

https://homepages.dcc.ufmg.br/~fernando/publications/papers_pt/Breno16Tools.pdf

Bones was another such project:

It did well on codes that were amenable to a loop analysis but had difficulties speeding things up that were more general. These types of parallelism models have been called “Skeleton-based programming models”. Other languages to look at in this space include:

https://skelcl.github.io/

https://muesli.uni-muenster.de

https://parallelme.github.io/

There’s a project from NIST that I’m missing too…

Most are skeleton-based because that’s usually a good way to get something working (i.e. fast enough to do some example well) before taking off to be a whole thing.

Probably the thing I would point to that worked the most in this space so far is hiCUDA, which if you read their papers you’ll see very similar language:

We have designed hiCUDA,
a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks
in a simpler manner, and directly to the sequential code.

Its promise was that you could write something pretty close to normal C++ but it would automatically construct the parallel code. The main difference from Bend is that it required you control memory allocations so that you could enforce locality, as this was pretty necessary for the “compiler-guessed CUDA” to get close to the handwritten optimal CUDA. But getting the data handling right is hard, so it was an interesting thing to just opt-out of that part and only focus the automation on the code under the assumption the user will locate the data properly.

There have been many codes to automatically construct DAGs to be alternatively compiled. Those projects have effectively been thrown away because they weren’t actually fast, just like Bend. The pieces that you actually see shared and remaining are pieces with opt-in parallelism because again, anything that was not opt-in was not able to guarantee it wouldn’t slow down normal execution. Removing the opt-in nature is a choice, not a technical hurdle. Just no one has found a good enough solution to the scheduling problem to justify moving to a more automated interface. And clearly Bend hasn’t solved that problem either.

Topic		Replies	Views
[blog post] Introduction to GPU programming Community gpu , cudanative , gpuarrays , blog-post	15	3309	December 20, 2018
NVIDIA Finally Adds Native Python Support to CUDA Offtopic	6	1421	April 14, 2025
Prioritising GPU Primitives from Vendor-Specialised Libraries GPU gpu , discussion , algorithm	1	391	December 15, 2022
JuliaUserGroupMunich: Tim Besard, GPU Programming in Julia: What, Why and How? Meetups announcement , gpu	4	587	December 15, 2023
CUDAnative is awesome! GPU	12	5971	December 3, 2018

Related topics