The Prospero Challenge

I just found the “Prospero Challenge” on hackernews and thought it’s a really nice toy example to write a parser and to analyze compile times. Maybe it’s even possible to compile the entire document, the parent repo also looks interesting. I currently don’t have time to give it a try, but if one of you gets nerd sniped, I’d love to see it to learn a little more :slight_smile:

7 Likes

I know this isn’t the point, but I made a straightforward translation of prospero.vm to Julia syntax

# Text of a monologue from The Tempest
function prospero(x,y)
	_0 = 2.95
	_1 = x
	_2 = 8.13008
...
	_1eb7 = min(_1eaf, _1eb6)
	_1eb8 = max(_1ea9,_1eb7)
	_1eb9 = min(_1ea3, _1eb8)
end

and got these timings with my old version of Julia on my old computer.

julia> @time @eval prospero(0.5,0.5)
  2.622024 seconds (1.92 M allocations: 86.854 MiB, 0.54% gc time, 99.75% compilation time)
-0.028191772642285284



julia> @benchmark prospero($rand(), $rand())
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  2.788 μs …  23.913 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.225 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.286 μs ± 530.315 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▂▃▇█▇▁▁
  ▂▂▂▃▄▇███████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  2.79 μs         Histogram: frequency by time        5.55 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.



julia> @time [prospero(x,y) for x in range(-1,1,length=1024) for y in range(-1,1,length=1024)];
  3.065433 seconds (100.15 k allocations: 16.418 MiB, 0.32% gc time, 1.70% compilation time)
1 Like

You are benchmarking a loop in global scope. It should only involve a single allocation if you do it right, no?

1 Like

Do we have the right to use F16 ? 12GB is better

function process_one_op!(v,op,name,args)
    if haskey(v,name)
        broadcast!(op,v[name],v[args[1]],v[args[2]])
    else
        v[name] = broadcast(op,v[args[1]],v[args[2]])
    end
    return nothing
end
function process_single_op!(v,op,name,args)
    if haskey(v,name)
        broadcast!(op,v[name],v[args[1]])
    else
        v[name] = broadcast(op,v[args[1]])
    end
    return nothing
end
function copytoifexists!(v,name,x)
    if haskey(v,name)
        copyto!(v[name],x)
    else
        v[name] = x
    end
    return nothing
end
function const_change!(v,name,c)
    if haskey(v,name)
        v[name] .= c
    else
        v[name] = [c;;]
    end
    return nothing
end
function process_op!(v,op,name,args,T,x,y)
    if op == "var-x"
        copytoifexists!(v,name,x)
    elseif op == "var-y"
        copytoifexists!(v,name,y)
    elseif op == "const"
        const_change!(v,name,parse(T, args[1]))
    elseif op == "add"
        process_one_op!(v,+,name,args) # v[args[1]] .+ v[args[2]]
    elseif op == "sub"
        process_one_op!(v,-,name,args) #v[args[1]] .- v[args[2]]
    elseif op == "mul"
        process_one_op!(v,*,name,args) #v[args[1]] .* v[args[2]]
    elseif op == "max"
        process_one_op!(v,max,name,args) #max.(v[args[1]], v[args[2]])
    elseif op == "min"
        process_one_op!(v,min,name,args) #min.(v[args[1]], v[args[2]])
    elseif op == "neg"
        process_single_op!(v,-,name,args) #-v[args[1]]
    elseif op == "square"
        process_single_op!(v,x->x^2,name,args) #v[args[1]].^2
    elseif op == "sqrt"
        process_single_op!(v,sqrt,name,args) #sqrt.(v[args[1]])
    else
        error("unknown opcode '$op'")
    end
    return nothing
end

function main(file)
    strip_file = strip(read(file, String))
    T = Float16
    image_size = 1024
    space = range(-1.0f0, 1.0f0, image_size)
    x = T[space[j] for i in 1:image_size, j in 1:image_size]
    y = T[-space[i] for i in 1:image_size, j in 1:image_size]
    v = Dict{String, Matrix{T}}()
    final_key = ""
    @fastmath for line in split(strip_file, "\n")
        if startswith(line, "#")
            continue
        end
        tokens = split(line)
        name = tokens[1]  
        op = tokens[2]   
        @views args = tokens[3:end]
        final_key = name  
        process_op!(v,op,name,args,T,x,y)
    end
    out = v[final_key]
    img = ifelse.(out .< 0, UInt8(255), UInt8(0))
    img = permutedims(img)
    open("out.ppm", "w") do f
        write(f, "P5\n$(image_size) $(image_size)\n255\n")
        write(f, vec(img))
    end
    return nothing
end
GC.gc()
@time main("prospero.vm")

I get around 8s but its really dependent on the machine I think. Sorry for the naming its a mess :')
GPU translation is easier than ever too (just load x and y on the gpu, make the Dict accept CuMatrix and collect before writing)

using CUDA

@fastmath function process_one_op!(v,op,name,args)
    if haskey(v,name)
        broadcast!(op,v[name],v[args[1]],v[args[2]])
    else
        v[name] = broadcast(op,v[args[1]],v[args[2]])
    end
    return nothing
end
@fastmath function process_single_op!(v,op,name,args)
    if haskey(v,name)
        broadcast!(op,v[name],v[args[1]])
    else
        v[name] = broadcast(op,v[args[1]])
    end
    return nothing
end
@fastmath function copytoifexists!(v,name,x)
    if haskey(v,name)
        copyto!(v[name],x)
    else
        v[name] = x
    end
    return nothing
end
@fastmath function const_change!(v,name,c)
    if haskey(v,name)
        v[name] .= c
    else
        v[name] = cu([c;;])
    end
    return nothing
end
@fastmath  function process_op!(v,op,name,args,T,x,y)
    if op == "var-x"
        copytoifexists!(v,name,x)
    elseif op == "var-y"
        copytoifexists!(v,name,y)
    elseif op == "const"
        const_change!(v,name,parse(T, args[1]))
    elseif op == "add"
        process_one_op!(v,+,name,args) # v[args[1]] .+ v[args[2]]
    elseif op == "sub"
        process_one_op!(v,-,name,args) #v[args[1]] .- v[args[2]]
    elseif op == "mul"
        process_one_op!(v,*,name,args) #v[args[1]] .* v[args[2]]
    elseif op == "max"
        process_one_op!(v,max,name,args) #max.(v[args[1]], v[args[2]])
    elseif op == "min"
        process_one_op!(v,min,name,args) #min.(v[args[1]], v[args[2]])
    elseif op == "neg"
        process_single_op!(v,-,name,args) #-v[args[1]]
    elseif op == "square"
        process_single_op!(v,x->x^2,name,args) #v[args[1]].^2
    elseif op == "sqrt"
        process_single_op!(v,sqrt,name,args) #sqrt.(v[args[1]])
    else
        error("unknown opcode '$op'")
    end
    return nothing
end

@fastmath  function main(file)
    strip_file = strip(read(file, String))
    T = Float16
    image_size = 1024
    space = range(-1.0f0, 1.0f0, image_size)
    x = T[space[j] for i in 1:image_size, j in 1:image_size] |> cu
    y = T[-space[i] for i in 1:image_size, j in 1:image_size] |> cu
    v = Dict{String, CuMatrix{T}}()
    final_key = ""
    @fastmath for line in split(strip_file, "\n")
        if startswith(line, "#")
            continue
        end
        tokens = split(line)
        name = tokens[1]  
        op = tokens[2]   
        @views args = tokens[3:end]
        final_key = name  
        process_op!(v,op,name,args,T,x,y)
    end
    out = v[final_key]
    img = ifelse.(out .< 0, UInt8(255), UInt8(0)) |> collect
    img = permutedims(img)
    open("out.ppm", "w") do f
        write(f, "P5\n$(image_size) $(image_size)\n255\n")
        write(f, vec(img))
    end
    return nothing
end
GC.gc()
CUDA.@time main("prospero.vm")

and this gives 1.1s (again for me)

  1.120287 seconds (873.03 k CPU allocations: 34.781 MiB, 2.62% gc time) (7.87 k GPU allocations: 12.618 GiB, 1.34% memmgmt time)

It also leads to the right image obviously.

1 Like

I think this is exactly the point. To quote the author of the challenge:

Using LLVM is exactly what you do! (highlight was mine) I think the author is working on efficient compilers for his domain, so a LLVM baseline is certainly nice to have.

Here’s another interesting article: Prospero challenge, now with more garbage collection | Max Bernstein
But I think this “garbage collection” is superfluous when you compile it anyway :slight_smile:

Oh and I wonder how fast it becomes, when you precompile the function and then run it on the GPU, like fusing your and @yolhan_mannes approach. @Jeff_Emanuel could you dump the function you used to parse the file into that expression?

1 Like

I can submit once we’re sure : GitHub - yolhan83/ProsperoChal: ProsperoChal is an attempt to the Prospero Challenge in julia..

Do you know if people will have to instantiate ? I may need to add that to the readme.

Also, still not sure I’m not cheating using Float16

2 Likes

I made a straightforward one here:

function parseit(filename)
    prog = String[]
    push!(prog, "function(x,y) ")
    for line in eachline(filename)
        (length(line) == 0 || line[1]=='#') && continue
        parts = split(line, ' ')
        var = parts[1]
        op = parts[2]

        if length(parts) > 2
            arg1 = parts[3]
        end
        if length(parts) > 3
            arg2 = parts[4]
        end

        if op=="const"
            push!(prog, string(var, "=", arg1))
        elseif op=="var-x"
            push!(prog, string(var, "=x"))
        elseif op=="var-y"
            push!(prog, string(var, "=y"))
        elseif op=="mul"
            push!(prog, string(var, "=",arg1,"*",arg2))
        elseif op=="add"
            push!(prog, string(var, "=",arg1,"+",arg2))
        elseif op=="sub"
            push!(prog, string(var, "=",arg1,"-",arg2))
        elseif op=="neg"
            push!(prog, string(var, "= -",arg1))
        elseif op=="max"
            push!(prog, string(var, "=ifelse(", arg1, ">", arg2, ",", arg1, ", ", arg2, ")"))
        elseif op=="min"
            push!(prog, string(var, "=ifelse(", arg1, ">", arg2, ",", arg2, ", ", arg1, ")"))
        elseif op=="square"
            push!(prog, string(var, "=", arg1, "*", arg1))
        elseif op=="sqrt"
            push!(prog, string(var, "=sqrt(",arg1,")"))
        else
            error("unknown op: $op /$line/")
        end

    end
    push!(prog, "end")
    pp = Meta.parse(join(prog, ';'))
    @eval $pp
end

function mkimage(f)
    x = range(-1, 1, length=1024)
    y = range(-1, 1, length=1024)
    img = Matrix{Bool}(undef, 1024, 1024)
    Threads.@threads for j in eachindex(y)
        for i in eachindex(x)
            img[i, j] = f(x[i], y[j]) < 0
        end
    end

    return img
end

@time "first parse" f = parseit("prospero.vm");
@time "warm parse" f = parseit("prospero.vm");
@time "first img" mkimage(f);
@time "warm img" mkimage(f);

It says:

first parse: 0.755518 seconds (485.25 k allocations: 23.291 MiB, 2.88% gc time)
warm parse: 0.720931 seconds (485.25 k allocations: 23.291 MiB, 1.63% gc time)
first img: 2.034268 seconds (2.55 M allocations: 114.802 MiB, 2.85% gc time, 1360.18% compilation time)
warm img: 0.160573 seconds (85 allocations: 8.010 MiB)

I haven’t checked that the right image is produced.
Edit: Changed the image to be Bool.

5 Likes

This is great but I’m not sure I can make it work inside a package because of all invoke_latest I will need beside, can’t use them for gpu it seems.

edit : found a way :wink:
edit2 : image is incorect :cry: (on cpu too)

I changed the image to be Bool. I can spy the matrix to see the image. Perhaps it’s the transpose that’s needed, but then upside down or something.

Fixed on cpu I just need to adapt to gpu now
done,
full code :

module ProsperoChal
    using CUDA,BenchmarkTools
    const T = Float16
    const device = CUDA.cu # use identity for cpu

    function parseit(filename)
        prog = String[]
        push!(prog, "@fastmath function(x,y) ")
        for line in eachline(filename)
            (length(line) == 0 || line[1]=='#') && continue
            parts = split(line, ' ')
            var = parts[1]
            op = parts[2]
    
            if length(parts) > 2
                arg1 = parts[3]
            end
            if length(parts) > 3
                arg2 = parts[4]
            end
    
            if op=="const"
                push!(prog, string(var, '=', "T(",arg1,")"))
            elseif op=="var-x"
                push!(prog, string(var, "=x"))
            elseif op=="var-y"
                push!(prog, string(var, "=y"))
            elseif op=="mul"
                push!(prog, string(var, '=',arg1,'*',arg2))
            elseif op=="add"
                push!(prog, string(var, '=',arg1,'+',arg2))
            elseif op=="sub"
                push!(prog, string(var, '=',arg1,'-',arg2))
            elseif op=="neg"
                push!(prog, string(var, "= -",arg1))
            elseif op=="max"
                push!(prog, string(var, "=max(", arg1, ',', arg2, ")"))
            elseif op=="min"
                push!(prog, string(var, "=min(", arg1, ',', arg2, ")"))
            elseif op=="square"
                push!(prog, string(var, '=', arg1, '*', arg1))
            elseif op=="sqrt"
                push!(prog, string(var, "=sqrt(",arg1,")"))
            else
                error("unknown op: $op /$line/")
            end
        end
        push!(prog, "end")
        pp = Meta.parse(join(prog, ';'))
        @eval $pp
    end

    const FUN = parseit("prospero.vm")
    function (@main)(ARGS)
        image_size = ARGS[1]
        image_size = parse(Int, image_size)
        space = range(-one(T), one(T), image_size) |> collect |> device
        O = UInt8(255)
        f(si,sj) = O*(FUN(si,sj) < 0)
        out = f.(space,-space')
        out_cpu= collect(out)
        open("out.ppm", "w") do f
            write(f, "P5\n$(image_size) $(image_size)\n255\n")
            write(f, vec(out_cpu))
        end
        nothing
        return nothing
    end
    function bench_proper(ARGS)
        b = @benchmark main($ARGS)
        display(b)
        return nothing
    end
    function profile_cuda(ARGS)
        main(ARGS)
        display(CUDA.@profile main(ARGS))
        return nothing
    end
    export main, bench_proper, profile_cuda
end # module ProsperoChal

This should definetly be a kernel

        f(si,sj) = O*(FUN(si,sj) < 0)
        out = f.(space,-space')

but whatever.
I think its good as it is ? Note that I only time execution and not parsing since I can’t because of world-age cuda issues.

2 Likes

2.8 ms. Not bad.

ProsperoChal.bench_proper(["1024"])
BenchmarkTools.Trial: 1673 samples with 1 evaluation per sample.
 Range (min … max):  2.607 ms …  12.196 ms  ┊ GC (min … max): 0.00% … 69.16%
 Time  (median):     2.769 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.973 ms ± 588.800 μs  ┊ GC (mean ± σ):  4.63% ± 10.48%

  ▁▆█▇▆▅▆▅▃▁               ▂▁       ▁▁                        ▁
  ███████████▇▇▄▇▆▅▄▅▄▄▅▅▆▇██▄▁▅▅▅▆███▇▅▄▅▇▅▅▅▄▄▁▁▅▁▄▅▇▇█▇▅▅▇ █
  2.61 ms      Histogram: log(frequency) by time      4.98 ms <

 Memory estimate: 2.01 MiB, allocs estimate: 256.

You have a better gpu then me for sure I’m at 10ms but yeah its great !

Yes, I’m aware of the perils of benchmarking in global scope, but I was lazy. And yes, it would have been only a single extra line of code. All I wanted to show was that the total runtime for the 1024x1024 case was still about a million times the single invocation time from @btime and that even for my suboptimal setup, the CPU and (especially memory) usage were much lower than the author’s baseline Python.

From the author, this thread’s OP, and the discussion on hackernews, it seems more that people were interested in implementing their own parsers compiler optimizations. So translating back to a different high-level language that compiles using LLVM seemed like a bit of a cheat, but you are right, it does give a LLVM baseline.

I didn’t write code to translate prospero.vm. I just did a handful of search and replace operations in vim. These can be replicated in Julia with match. For instance, this vim

:%s/\(_[a-f0-9]*\) add \([^ ]*\) \([^ ]*\)/\1 = \2 + \3/

could be

m = match(r"(_[a-f0-9]*) add ([^ ]*) ([^ ]*)", s)
if m!==nothing
  println("$(m.captures[1]) = $(m.captures[2]) + $(m.captures[3])")
end

Of course, these regular expressions could more flexible or rigorous, but they work for the example file.

Quite happy with it now GitHub - yolhan83/ProsperoChal: ProsperoChal is an attempt to the Prospero Challenge in julia. easy to switch gpu-cpu and better TTFX using KernelAbstractions.jl to fuse the operation (f.(space,-space') .< 0).*0xff into only one kernel. I mark this as solution but if someone else wants to give it a try, go on !

4 Likes

I don’t know about GPUs, but on my CPU, using ifelse instead of max and min lead to significant performance improvements. ifelse produces cmov or vmax/min whereas min and max may produce jmp instructions which may cause pipeline stalls. Though, I’m not certain what @fastmath does with it, probably using ifelse.

Hey @yolhan_mannes – I tried to reply to your email but it bounced, so I figured I’d reach out here. Below is my proposed blurb for the site; I’ll wait for your confirmation / edits before posting:

Julia metaprogramming
Yolhan Mannes et al, 2025

Github repo | Forum thread (these will be links)

This implementation is a great demonstration of Julia’s flexibility and metaprogramming capabilities. Using relatively straight-forward code, the expression is parsed and compiled at runtime, then evaluated either on the GPU or CPU depending on selected backend. GPU evaluation is about 10 ms/frame on a NVIDIA GeForce 4060, while the CPU backend takes around 492 ms. Julia uses LLVM for its JIT compiler, so the resulting functions are fast to evaluate but costly to compute; the authors see about 13 seconds of compilation time.

(I think I’m reading the compilation time correctly, from the first Measure-Command example in your repo)

2 Likes

That’s a really nice summary thank you its perfect !

Okay, it’s now published :smiley:

Thanks for the contribution, it’s always neat to see Julia in action!

1 Like