[ANN] BootlegCassette.jl

Cassette.jl is an incredibly cool package that allows one to dictate how functions are dispatched in special execution contexts, but it’s currently not fully functional on Julia 1.6 / master, so I made a minimal package that replicates it’s main feature and API: contextual dispatch. I call it BootlegCassette.jl. If you don’t need Cassette’s tagging capabilities or some of the more advanced usages of certain APIs, this package may be a drop in replacement.

I mostly made this for fun and learning, but if anyone wants to try and get more things working or improve performance, I’ll gladly collaborate.

Happy overdubbing!

using BootlegCassette
const Cassette = BootlegCassette

Cassette.@context Ctx 
Cassette.prehook(::Ctx, f, args...) = println(f, args)
Cassette.overdub(Ctx(), /, 1, 2)

sitofp(Float64, 1)
sitofp(Float64, 2)
/(1.0, 2.0)
div_float(1.0, 2.0)

nice name


Does this have any overhead? I’ve stayed away from Casette for that reason.
If not, can it be used to replace calls to Base.ifelse with IfElse.ifelse, for example?

I’m wondering if it’d be useful for trying to run generic code with SIMD, in SPMD style.

1 Like

Yes, this actually has greater overhead than Cassette in my experiments. However, there are tricks one can use to avoid that for some use cases. For instance, if I understand correctly CUDA.jl will basically build up a whole new function by tracing through a function call and inlining everything into one big new function IR where all the dynamic behaviour and such has been excised.

I think such an approach may be feasible for LoopVectorization.jl, but I’m not sure. You’d probably want to use IRTools.jl for this directly, or if you can, wait and see what the AbstractInterpreter interface ends up being like.

Not only can approaches like this do that, but they can also even replace if ... else if you’re devious enough.

1 Like

Okay, I’ll probably wait for AbstractInterpreter, but replacing if ... else ... statements is definitely something I’d like to be able to do.
It’d be tricky even if you have the infrastructure. You’d have to walk both sides and see all assignments to combine them.
Ideally, you also don’t just walk both sides, because maybe you can combine them:

if cond
    y = 3x
    z = exp(y)
    z = 3x
    y = exp(z)
1 Like

This could be done in a common sub expression elimination pass I think.

Yeah, you’re right. Relying on the naive approach:

y1 = 3*x
z1 = exp(y1)
z2 = 3*x
y2 = exp(z2)
y = ifelse(cond, y1, y2)
z = ifelse(cond, z1, z2)
y, z

and relying on CSE is probably the best way to do this. If any problematic/suboptimal examples turn up, we can worry about finding out why and how to fix them then.

Have you guys seen
The goal of this package is exactly to enable SPMD for arbitrary programs. Unfortunately, it appears to have been abandoned.

There was a short discussion about if statements in

1 Like

I knew of the library (and that it was abandoned), but hadn’t looked for a while or seen your issue, thanks.
Out of curiosity, have you seen this series of blog posts?
I can confirm that for a few benchmarks, ISPC was many times faster than a comparable Julia program. I’d like to fix that :wink: , but more generally, it’d be great to make packages like MonteCarloMeasurements faster, as well as use this to SIMD DIfferentialEquations ensemble solves (and obviously LoopVectorization wants this too).

Still, it seems like Hydra is probably advanced enough to at least look at for a starting point.


I had not seen it before, but having read the linked entry, I find it quite interesting.

And God help you when they release a new version of the compiler with changes to the auto-vectorizer’s implementation.

I’ve felt this, and similar issues like @inferred etc., before, so julia is not free from it. I guess anyone relying on an auto vectorizer or type inference should really consider contributing meaningful tests and benchmark tests to julia.

I also enjoyed the story on the PR adding ARM support in the ispc repo :slight_smile: I can imagine the awkward feelings for everyone involved!


I was trying to figure out something for this exact thing today (among other stuff) and eventually was let to this post. If/when there is more info on using all the magic in “julia/base/compiler/” I’m going to probably disappear from life for a week while I digest it.

1 Like

Just an update, Cassette.jl has been updated to work on 1.6 and nightly! https://github.com/JuliaLabs/Cassette.jl/releases/tag/v0.3.4

Bravo @simeonschaub!


This is really cool, in and of its own right, because it is Cassette built on IRTools.

Have you done any benchmarking?


Yes, it’s pretty slow and I’m not exactly sure what’s going on. Here’s an interesting example of something that’s wrong and causing a performance problem:

using BootlegCassette
using BootlegCassette: @context, overdub

@context Ctx

let x = Ref(1), y = Ref(2)
    @btime overdub(Ctx(), *, $x[], $y[])

  16.292 ns (0 allocations: 0 bytes)

Investigating the typed code doesn’t reveal anything amis:

@code_typed overdub(Ctx(), *, 1, 2)

1 ─ %1 = Base.getfield(args, 3)::Int64
│   %2 = Base.getfield(args, 4)::Int64
│   %3 = (Core.Intrinsics.mul_int)(%1, %2)::Int64
└──      return %3
) => Int64

but the LLVM code tells a different story:

@code_llvm overdub(Ctx(), *, 1, 2)

;  @ /home/mason/.julia/packages/IRTools/aSVI5/src/reflection/dynamo.jl:114 within `overdub'
define nonnull {}* @japi3_overdub_1970({}* %0, {}** %1, i32 %2, {}** %3) #0 {
  %4 = alloca {}**, align 8
  store volatile {}** %1, {}*** %4, align 8
  %5 = icmp ugt i32 %2, 2
  br i1 %5, label %pass, label %fail

fail:                                             ; preds = %top
  %6 = sext i32 %2 to i64
  call void @jl_bounds_error_tuple_int({}** %1, i64 %6, i64 3)

pass:                                             ; preds = %top
  %.not = icmp eq i32 %2, 3
  br i1 %.not, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  call void @jl_bounds_error_tuple_int({}** %1, i64 3, i64 4)

pass2:                                            ; preds = %pass
  %7 = getelementptr inbounds {}*, {}** %1, i64 2
  %8 = bitcast {}** %7 to i64**
  %9 = load i64*, i64** %8, align 8
  %10 = getelementptr inbounds {}*, {}** %1, i64 3
  %11 = bitcast {}** %10 to i64**
  %12 = load i64*, i64** %11, align 8
;  @ /home/mason/.julia/packages/IRTools/aSVI5/src/reflection/dynamo.jl within `overdub' @ int.jl:88
; ┌ @ /home/mason/Dropbox/Julia/BootlegCassette/src/BootlegCassette.jl:49 within `overdub_pass'
; │┌ @ /home/mason/Dropbox/Julia/BootlegCassette/src/BootlegCassette.jl:55 within `overdub'
    %13 = load i64, i64* %9, align 8
    %14 = load i64, i64* %12, align 8
    %15 = mul i64 %14, %13
    %16 = call nonnull {}* @jl_box_int64(i64 signext %15)
    ret {}* %16
; └└

Somehow, LLVM doesn’t seem to know how many arguments there are to the function or something.

Did you try args::Vararg{Any,K} instead of args... to force specialization on the number of arguments (in the definition of overdub)?

1 Like

Yes and it didn’t help, however that led me to discover this:

using IRTools
IRTools.@dynamo foo(x::Int, y::String, z::Symbol, w::Int...) = nothing


 # 1 method for generic function "foo":
 [1] foo(args...) in Main at /home/mason/.julia/packages/IRTools/aSVI5/src/reflection/dynamo.jl:114

So it may be an issue with how @dynamo is creating the function.


Looks like it’s time to hack on IRTools then :wink:
Hopefully changing how it generates the function solves it.


Can I ask, where I can read about AbstractInterpretter?