Bend: a new GPU-native language

Quick comment as a Tapir author here.

So the model of parallelism there is essentially individual tasks that can be optimized quite well and even run very lightweight (and some folks have even built custom hardware for lightweight threads compiled using tapir)

Much like the canonical enzyme example, the canonical tapir example shows why you want to run optimizations on parallel code (and get potentially asymptotic speed ups). For more details see the latter half of https://c.wsmoses.com/presentations/defense.pdf (including how applying tapir optimizations make for much faster parallel gradients).

While there has been various follow up work like Polygeist that does autoparallelization, my (admittedly limited) read of Bend is that it focuses more on identifying opportunities for parallelism via functional programming more than optimizing parallel code.

Of course that actually means the two could work together! (Use bend to find opportunities for parallelism and tapir to optimize it to run/scale better)

6 Likes

I should be clear I’m not claiming anything here about Bend being state-of-the-art performance on GPUs (I also don’t have any personal investment in this project, just sharing something I thought was cool!). I basically just haven’t seen such a high-level and flexible language that can be compiled to native GPU code. I think it’s cool that it even works!

I guess I would hope performance would come later. It would obviously never be as fast as code you could describe in normal CUDA. But for code patterns that I wouldn’t have been able to (efficiently) describe in some CUDA kernels anyways, this seems like an option.

I mean, I don’t think so? Sure you can write any program in the form of transducers/folds, if that’s what you mean, but you wouldn’t be able to access the same parallelism axes possible here, unless you explicitly put it in (which for some algorithms, is very very difficult, to the point you just wouldn’t even try!).

Sounds cool, thanks for the slides

1 Like

Yes, Bend forces all programming into functional programming with immutable data structures and then does scheduling on it. Tapir doesn’t require immutable data structures at all. My point though is that what Taka was doing was building such immutable data structures and functional transducers and then setting up a Tapir-like scheduler for doing automating the parallelism, so exactly that combination of the two.

Can you show me what other constructs are available in Bend? My read of the docs shows exactly the same functional programming constructs as the Clojure/Transducers.jl model (immutable recursive data structures, bends, and folds).

3 Likes

Consider the example they give here of a sorting algorithm:

# Sorting Network = just rotate trees!
def sort(d, s, tree):
  switch d:
    case 0:
      return tree
    case _:
      (x,y) = tree
      lft   = sort(d-1, 0, x)
      rgt   = sort(d-1, 1, y)
      return rots(d, s, lft, rgt)

# Rotates sub-trees (Blue/Green Box)
def rots(d, s, tree):
  switch d:
    case 0:
      return tree
    case _:
      (x,y) = tree
      return down(d, s, warp(d-1, s, x, y))

# Swaps distant values (Red Box)
def warp(d, s, a, b):
  switch d:
    case 0:
      return swap(s + (a > b), a, b)
    case _:
      (a.a, a.b) = a
      (b.a, b.b) = b
      (A.a, A.b) = warp(d-1, s, a.a, b.a)
      (B.a, B.b) = warp(d-1, s, a.b, b.b)
      return ((A.a,B.a),(A.b,B.b))

# Propagates downwards
def down(d,s,t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return (rots(d-1, s, t.a), rots(d-1, s, t.b))

# Swaps a single pair
def swap(s, a, b):
  switch s:
    case 0:
      return (a,b)
    case _:
      return (b,a)

# Testing
# -------

# Generates a big tree
def gen(d, x):
  switch d:
    case 0:
      return x
    case _:
      return (gen(d-1, x * 2 + 1), gen(d-1, x * 2))

# Sums a big tree
def sum(d, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return sum(d-1, t.a) + sum(d-1, t.b)

# Sorts a big tree
def main:
  return sum(18, sort(18, 0, gen(18, 0)))

This entire thing gets compiled to GPU-native code. All parts that can be run in parallel, are run in parallel on the GPU (so they claim).

Is it possible to do this in Transducers/Folds such that it is executed on the GPU, exploiting all the same axes of parallelism? I don’t think this is possible currently but I would be very happy to be proven wrong.


(My guess is that it can only be done by rewriting all the code to be executed on arrays. But this is sort of the main point of attraction of bend – you don’t need to rewrite it!)

3 Likes

That code is just a recursive function on a immutable recursive data structure so it’s handled by exactly the same infrastructure that’s already discussed?

1 Like

Okay but how would you actually write this in Julia such that it can run entirely on GPUs? I don’t think it’s possible (without a lot of work, and switching to array-based trees). Can FoldsCUDA do this?

1 Like

I believe there’s a circular dependency issue that hampers the progress of Julia.

Without a guarantee of continuity, an entire branch of the package ecosystem might become obsolete.

Perhaps if there were a rule in place, ensuring that any package, regardless of its version, could be installed and function smoothly, this problem could be mitigated.

(@v1.10) pkg> add FoldsCUDA
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package CUDA [052768ef]:
 CUDA [052768ef] log:
 ├─possible versions are: 0.1.0-5.3.4 or uninstalled
 ├─restricted to versions * by cuPDLP [c7d20397], leaving only versions: 0.1.0-5.3.4
 │ └─cuPDLP [c7d20397] log:
 │   ├─possible versions are: 0.0.0 or uninstalled
 │   └─cuPDLP [c7d20397] is fixed to version 0.0.0
 ├─restricted by compatibility requirements with FoldsCUDA [6cd66ae4] to versions: 2.0.0-3.13.1
 │ └─FoldsCUDA [6cd66ae4] log:
 │   ├─possible versions are: 0.1.0-0.1.9 or uninstalled
 │   └─restricted to versions * by an explicit requirement, leaving only versions: 0.1.0-0.1.9
 └─restricted by compatibility requirements with cuDNN [02a925ec] to versions: 4.0.0-5.3.4 — no versions left
   └─cuDNN [02a925ec] log:
     ├─possible versions are: 1.0.0-1.3.1 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions: 1.0.0-1.3.1
(@v1.10) pkg> st
CUDA v5.3.4
cuDNN v1.3.1
cuPDLP v0.0.0 `https://github.com/jinwen-yang/cuPDLP.jl#master`

Chris has been talking about a lot of very interesting prior art here, but I’d like to point out that Bend is following a well-known tradition of confusing scalability for performance.

The paper above coined it as COST - Configuration Over a Single Thread, or how many cores does your system need to achieve a speedup faster than well implemented single-threaded version.

From the first comment on the HackerNews,

under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:
    CPU, Apple M3 Max, 1 thread: 3.5 minutes
    CPU, Apple M3 Max, 16 threads: 10.26 seconds
    GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds

whereas my Julia version is

julia> function bsum(depth, x)
           if depth == 0
               return x
           else
               fst = bsum(depth - 1, x*2+0)
               snd = bsum(depth - 1, x*2+1)
           end
           return fst + snd
       end
bsum (generic function with 1 method)

julia> @time bsum(30,0)
  3.098500 seconds
576460751766552576

which means Bend has a COST of about 16k threads over a single threaded Julia implementation (assuming linear scaling on the GPU, please correct me.) - and we don’t even have tail call optimization! I believe Coalton or some other Lisp that does compile to native code will smoke the benchmark.

…Frankly, I would be a somewhat more optimistic about the potential research direction of Bend if the lead dev didn’t have

Reaching AGI to cure all diseases and suffering is all that matters

as part of their twitter bio or if they weren’t involved with all the super shady cryptocurrency stuff.

If anything, I’m glad other people are talking about really cool parallel programs like Futhark, which is an actual project that punches waaaay above its weight (for the cases where it’s applicable.)

We should be building better interfaces to Futhark for sure. Bend I’m not too keen on.

10 Likes

I mean in the AI research community this is far from unusual. But point taken!

6 Likes

Two things here. One is that doing research on a compiler that’s part of a production language cannot have that all other progress stops. You do have to fork the compiler to fix a version and do a study, and then if it’s successful merge. I don’t think it’s possible to assume that every Julia version has to support every research project imaginable, that’s just not possible.

That said, this project did have all of the intentions and drive to be a part of the compiler. A lot of PRs were made to Julia, but some were left unfinished. The reason why it was left in an unfinished state is somewhat of a mystery. Taka (who was a member of the MIT Julia Lab) effectively vanished one day and I never heard from him again. It would be extremely non-trivial to revive everything from the state it was left in, given this was not an intentional ending state from what anyone can tell.

If someone wanted to revive this, the path would probably to break it some smaller chunks. To make it useful, you’d probably want to start with the immutable portions, (a) setup new infrastructure for immutable arrays on the new Memory type, (b) rebuild GitHub - JuliaCollections/FunctionalCollections.jl: Functional and persistent data structures for Julia on top of the new immutable infrastructure, (c) optimize escape analysis and effects systems to get as performant as possible. Then (d) there’s a fork of JuliaFolds (JuliaFolds2 · GitHub) which should get a revived version of it all. Then (e) you’d want to revive RFC: Experimental API for may-happen in parallel parallelism by tkf · Pull Request #39773 · JuliaLang/julia · GitHub, and then (f) add the may-happen parallelism bits to the recursive data structures, and finally (g) revive the CUDA bits. We’d probably then want to investigate whether @floop/@reduce or some other reduction syntax would be good to add to the language to better support these workflows. But suffice to say, it would be non-trivial but this would get it done right, and all of this probably would’ve been needed to really get it all merged in with the 2022 version.

And even getting combinations of the pieces to run would require some historic digging to get the right Julia versions and versions of the dependencies. So no, I’m not going to do that right now. Instead I’ll write some grants to see if we can get someone to revive this :sweat_smile:.

Each step along the way would be useful in its own right though.

Futhark is one I missed and it’s a set of very promising ideas. Futhark and Chapel are probably the two that you’d point to as the most successful parallel programming languages today, if you don’t count Clojure for not supporting GPUs.

8 Likes

I’ve always wondered about this, thanks for explaining. I couldn’t count how many times I’ve checked his github activity to see if he had any new commits… Hope he’s okay

21 Likes

Good thread but it’s worth pointing out that while Bend takes full advantage of its Haskell-like immutable variables, it’s seen as a limitation not a feature (GUIDE.md):

Which is immutable. If that sounds annoying, that’s because it is . Don’t let anyone tell you otherwise. We are aware of that, and we have many ideas on how to improve this, making Bend feel even more Python-like. For now, we have to live with it.

The author doesn’t make it very clear what immutable and mutable means exactly because Python and Haskell have very distinct meanings. You could have immutable variables with mutable data structures, for example. Bend doesn’t seem to have a wholly positive outlook on immutable data, but it doesn’t aspire to mutable data so far:

In CUDA, this can be implemented by using mutable arrays and synchronization primitives. This is well known. What is less known is that it can also be implemented as a series of immutable tree rotations , with pattern-matching and recursion. Don’t bother trying to understand it, but…

Of course, you would absolutely not want to sort numbers like that, specially when mutable arrays exist. But there are many algorithms that can not be implemented easily with buffers. Evolutionary and genetic algorithms, proof checkers, compilers, interpreters. For the first time ever, you can implement these algorithms as high-level functions, in a language that runs on GPUs. That’s the magic of Bend!

Whether that part is really novel has been discussed here already.

I don’t think the “parallelize everything” part is all that bad, if only because it’s very easy to run into what Bend considers to be inherently serial programs:

Traditional loops, on the other hands, are inherently sequential. A loop like:
sum = 0; for i in range(8): sum += i
Is actually just a similar way to write:
sum = (0 + (1 + (2 + (3 + (4 + (5 + (6 + 7)))))))
Which is really bad for parallelism, because the only way to compute this is by evaluating the expressions one after the other, in order

but everyone probably prefers a higher order function with a nice enough description of its order to wrestling with operation order.

Is the proposed solution viable? Considering different package versions as distinct entities within the same package class, mandating that developers precisely define dependencies down to specific versions.

Thus, if a package, possibly the foundational one, discontinues development, it would not impede updates or further development for all packages branching from it.

To utilize a package, simply appending a version identifier to using [pkg] would suffice, with the system defaulting to the newest version.

In an extreme scenario, the most recent version of Julia could theoretically support a vast range of package versions, stretching from 0.0.0.1 up to 999.9.9.9 inclusively.

I don’t understand the comment. It needs stuff in Julia base itself.

15 posts were split to a new topic: Standardizing syntax?

How close is the Julia ecosystem to achieving what Bend can do?

You can use high-level Julia syntax in CUDA.jl within a CUDA kernel but as of yet there is no library readily available to write something like the recursive tree-sort example (using the same recursive algorithm and automatically generating CUDA kernels along all axes of parallelism). Though you could do it manually with arrays. As @ChrisRackauckas shared there are some research projects from various groups pushing towards this but nothing ready yet. I don’t have a sense of their expected time of completion.

If interested I would personally view the most viable near-term approach as finding a way to map from Julia constructs to XLA (e.g., see @wsmoses link) or HVM2. Always good to save some dev resources whenever we can so we can build other cool stuff

1 Like

reposting the package we’re working on since it does precisely that:

Of course it also naturally also Enzyme(MLIR) integration

7 Likes

I wonder if there is any website that collects this kind of long-term projects with a high degree of uncertainty in the Julia ecosystem. It might help these projects get better recognition, or at least Github stars.

Most are not made for long-term maintenance. Most such research projects are first demonstrations but by the end it’s clear that’s not the right way to do it. Anything useful is then stripped for parts out of it. It’s kind of unreasonable to assume everything that anybody tries is going to be production-ready enough to want to support it even 5 years later: some ideas lead to a good demonstration but ultimately show it’s not a good way to continue.

5 Likes