I just found the “Prospero Challenge” on hackernews and thought it’s a really nice toy example to write a parser and to analyze compile times. Maybe it’s even possible to compile the entire document, the parent repo also looks interesting. I currently don’t have time to give it a try, but if one of you gets nerd sniped, I’d love to see it to learn a little more
I know this isn’t the point, but I made a straightforward translation of prospero.vm to Julia syntax
# Text of a monologue from The Tempest
function prospero(x,y)
_0 = 2.95
_1 = x
_2 = 8.13008
...
_1eb7 = min(_1eaf, _1eb6)
_1eb8 = max(_1ea9,_1eb7)
_1eb9 = min(_1ea3, _1eb8)
end
and got these timings with my old version of Julia on my old computer.
julia> @time @eval prospero(0.5,0.5)
2.622024 seconds (1.92 M allocations: 86.854 MiB, 0.54% gc time, 99.75% compilation time)
-0.028191772642285284
julia> @benchmark prospero($rand(), $rand())
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 2.788 μs … 23.913 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.225 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.286 μs ± 530.315 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▇█▇▁▁
▂▂▂▃▄▇███████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
2.79 μs Histogram: frequency by time 5.55 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @time [prospero(x,y) for x in range(-1,1,length=1024) for y in range(-1,1,length=1024)];
3.065433 seconds (100.15 k allocations: 16.418 MiB, 0.32% gc time, 1.70% compilation time)
You are benchmarking a loop in global scope. It should only involve a single allocation if you do it right, no?
Do we have the right to use F16 ? 12GB is better
function process_one_op!(v,op,name,args)
if haskey(v,name)
broadcast!(op,v[name],v[args[1]],v[args[2]])
else
v[name] = broadcast(op,v[args[1]],v[args[2]])
end
return nothing
end
function process_single_op!(v,op,name,args)
if haskey(v,name)
broadcast!(op,v[name],v[args[1]])
else
v[name] = broadcast(op,v[args[1]])
end
return nothing
end
function copytoifexists!(v,name,x)
if haskey(v,name)
copyto!(v[name],x)
else
v[name] = x
end
return nothing
end
function const_change!(v,name,c)
if haskey(v,name)
v[name] .= c
else
v[name] = [c;;]
end
return nothing
end
function process_op!(v,op,name,args,T,x,y)
if op == "var-x"
copytoifexists!(v,name,x)
elseif op == "var-y"
copytoifexists!(v,name,y)
elseif op == "const"
const_change!(v,name,parse(T, args[1]))
elseif op == "add"
process_one_op!(v,+,name,args) # v[args[1]] .+ v[args[2]]
elseif op == "sub"
process_one_op!(v,-,name,args) #v[args[1]] .- v[args[2]]
elseif op == "mul"
process_one_op!(v,*,name,args) #v[args[1]] .* v[args[2]]
elseif op == "max"
process_one_op!(v,max,name,args) #max.(v[args[1]], v[args[2]])
elseif op == "min"
process_one_op!(v,min,name,args) #min.(v[args[1]], v[args[2]])
elseif op == "neg"
process_single_op!(v,-,name,args) #-v[args[1]]
elseif op == "square"
process_single_op!(v,x->x^2,name,args) #v[args[1]].^2
elseif op == "sqrt"
process_single_op!(v,sqrt,name,args) #sqrt.(v[args[1]])
else
error("unknown opcode '$op'")
end
return nothing
end
function main(file)
strip_file = strip(read(file, String))
T = Float16
image_size = 1024
space = range(-1.0f0, 1.0f0, image_size)
x = T[space[j] for i in 1:image_size, j in 1:image_size]
y = T[-space[i] for i in 1:image_size, j in 1:image_size]
v = Dict{String, Matrix{T}}()
final_key = ""
@fastmath for line in split(strip_file, "\n")
if startswith(line, "#")
continue
end
tokens = split(line)
name = tokens[1]
op = tokens[2]
@views args = tokens[3:end]
final_key = name
process_op!(v,op,name,args,T,x,y)
end
out = v[final_key]
img = ifelse.(out .< 0, UInt8(255), UInt8(0))
img = permutedims(img)
open("out.ppm", "w") do f
write(f, "P5\n$(image_size) $(image_size)\n255\n")
write(f, vec(img))
end
return nothing
end
GC.gc()
@time main("prospero.vm")
I get around 8s but its really dependent on the machine I think. Sorry for the naming its a mess :')
GPU translation is easier than ever too (just load x and y on the gpu, make the Dict accept CuMatrix and collect before writing)
using CUDA
@fastmath function process_one_op!(v,op,name,args)
if haskey(v,name)
broadcast!(op,v[name],v[args[1]],v[args[2]])
else
v[name] = broadcast(op,v[args[1]],v[args[2]])
end
return nothing
end
@fastmath function process_single_op!(v,op,name,args)
if haskey(v,name)
broadcast!(op,v[name],v[args[1]])
else
v[name] = broadcast(op,v[args[1]])
end
return nothing
end
@fastmath function copytoifexists!(v,name,x)
if haskey(v,name)
copyto!(v[name],x)
else
v[name] = x
end
return nothing
end
@fastmath function const_change!(v,name,c)
if haskey(v,name)
v[name] .= c
else
v[name] = cu([c;;])
end
return nothing
end
@fastmath function process_op!(v,op,name,args,T,x,y)
if op == "var-x"
copytoifexists!(v,name,x)
elseif op == "var-y"
copytoifexists!(v,name,y)
elseif op == "const"
const_change!(v,name,parse(T, args[1]))
elseif op == "add"
process_one_op!(v,+,name,args) # v[args[1]] .+ v[args[2]]
elseif op == "sub"
process_one_op!(v,-,name,args) #v[args[1]] .- v[args[2]]
elseif op == "mul"
process_one_op!(v,*,name,args) #v[args[1]] .* v[args[2]]
elseif op == "max"
process_one_op!(v,max,name,args) #max.(v[args[1]], v[args[2]])
elseif op == "min"
process_one_op!(v,min,name,args) #min.(v[args[1]], v[args[2]])
elseif op == "neg"
process_single_op!(v,-,name,args) #-v[args[1]]
elseif op == "square"
process_single_op!(v,x->x^2,name,args) #v[args[1]].^2
elseif op == "sqrt"
process_single_op!(v,sqrt,name,args) #sqrt.(v[args[1]])
else
error("unknown opcode '$op'")
end
return nothing
end
@fastmath function main(file)
strip_file = strip(read(file, String))
T = Float16
image_size = 1024
space = range(-1.0f0, 1.0f0, image_size)
x = T[space[j] for i in 1:image_size, j in 1:image_size] |> cu
y = T[-space[i] for i in 1:image_size, j in 1:image_size] |> cu
v = Dict{String, CuMatrix{T}}()
final_key = ""
@fastmath for line in split(strip_file, "\n")
if startswith(line, "#")
continue
end
tokens = split(line)
name = tokens[1]
op = tokens[2]
@views args = tokens[3:end]
final_key = name
process_op!(v,op,name,args,T,x,y)
end
out = v[final_key]
img = ifelse.(out .< 0, UInt8(255), UInt8(0)) |> collect
img = permutedims(img)
open("out.ppm", "w") do f
write(f, "P5\n$(image_size) $(image_size)\n255\n")
write(f, vec(img))
end
return nothing
end
GC.gc()
CUDA.@time main("prospero.vm")
and this gives 1.1s (again for me)
1.120287 seconds (873.03 k CPU allocations: 34.781 MiB, 2.62% gc time) (7.87 k GPU allocations: 12.618 GiB, 1.34% memmgmt time)
It also leads to the right image obviously.
I think this is exactly the point. To quote the author of the challenge:
Using LLVM is exactly what you do! (highlight was mine) I think the author is working on efficient compilers for his domain, so a LLVM baseline is certainly nice to have.
Here’s another interesting article: Prospero challenge, now with more garbage collection | Max Bernstein
But I think this “garbage collection” is superfluous when you compile it anyway
Oh and I wonder how fast it becomes, when you precompile the function and then run it on the GPU, like fusing your and @yolhan_mannes approach. @Jeff_Emanuel could you dump the function you used to parse the file into that expression?
I can submit once we’re sure : GitHub - yolhan83/ProsperoChal: ProsperoChal is an attempt to the Prospero Challenge in julia..
Do you know if people will have to instantiate ? I may need to add that to the readme.
Also, still not sure I’m not cheating using Float16
I made a straightforward one here:
function parseit(filename)
prog = String[]
push!(prog, "function(x,y) ")
for line in eachline(filename)
(length(line) == 0 || line[1]=='#') && continue
parts = split(line, ' ')
var = parts[1]
op = parts[2]
if length(parts) > 2
arg1 = parts[3]
end
if length(parts) > 3
arg2 = parts[4]
end
if op=="const"
push!(prog, string(var, "=", arg1))
elseif op=="var-x"
push!(prog, string(var, "=x"))
elseif op=="var-y"
push!(prog, string(var, "=y"))
elseif op=="mul"
push!(prog, string(var, "=",arg1,"*",arg2))
elseif op=="add"
push!(prog, string(var, "=",arg1,"+",arg2))
elseif op=="sub"
push!(prog, string(var, "=",arg1,"-",arg2))
elseif op=="neg"
push!(prog, string(var, "= -",arg1))
elseif op=="max"
push!(prog, string(var, "=ifelse(", arg1, ">", arg2, ",", arg1, ", ", arg2, ")"))
elseif op=="min"
push!(prog, string(var, "=ifelse(", arg1, ">", arg2, ",", arg2, ", ", arg1, ")"))
elseif op=="square"
push!(prog, string(var, "=", arg1, "*", arg1))
elseif op=="sqrt"
push!(prog, string(var, "=sqrt(",arg1,")"))
else
error("unknown op: $op /$line/")
end
end
push!(prog, "end")
pp = Meta.parse(join(prog, ';'))
@eval $pp
end
function mkimage(f)
x = range(-1, 1, length=1024)
y = range(-1, 1, length=1024)
img = Matrix{Bool}(undef, 1024, 1024)
Threads.@threads for j in eachindex(y)
for i in eachindex(x)
img[i, j] = f(x[i], y[j]) < 0
end
end
return img
end
@time "first parse" f = parseit("prospero.vm");
@time "warm parse" f = parseit("prospero.vm");
@time "first img" mkimage(f);
@time "warm img" mkimage(f);
It says:
first parse: 0.755518 seconds (485.25 k allocations: 23.291 MiB, 2.88% gc time)
warm parse: 0.720931 seconds (485.25 k allocations: 23.291 MiB, 1.63% gc time)
first img: 2.034268 seconds (2.55 M allocations: 114.802 MiB, 2.85% gc time, 1360.18% compilation time)
warm img: 0.160573 seconds (85 allocations: 8.010 MiB)
I haven’t checked that the right image is produced.
Edit: Changed the image to be Bool
.
This is great but I’m not sure I can make it work inside a package because of all invoke_latest I will need beside, can’t use them for gpu it seems.
edit : found a way
edit2 : image is incorect (on cpu too)
I changed the image to be Bool
. I can spy
the matrix to see the image. Perhaps it’s the transpose that’s needed, but then upside down or something.
Fixed on cpu I just need to adapt to gpu now
done,
full code :
module ProsperoChal
using CUDA,BenchmarkTools
const T = Float16
const device = CUDA.cu # use identity for cpu
function parseit(filename)
prog = String[]
push!(prog, "@fastmath function(x,y) ")
for line in eachline(filename)
(length(line) == 0 || line[1]=='#') && continue
parts = split(line, ' ')
var = parts[1]
op = parts[2]
if length(parts) > 2
arg1 = parts[3]
end
if length(parts) > 3
arg2 = parts[4]
end
if op=="const"
push!(prog, string(var, '=', "T(",arg1,")"))
elseif op=="var-x"
push!(prog, string(var, "=x"))
elseif op=="var-y"
push!(prog, string(var, "=y"))
elseif op=="mul"
push!(prog, string(var, '=',arg1,'*',arg2))
elseif op=="add"
push!(prog, string(var, '=',arg1,'+',arg2))
elseif op=="sub"
push!(prog, string(var, '=',arg1,'-',arg2))
elseif op=="neg"
push!(prog, string(var, "= -",arg1))
elseif op=="max"
push!(prog, string(var, "=max(", arg1, ',', arg2, ")"))
elseif op=="min"
push!(prog, string(var, "=min(", arg1, ',', arg2, ")"))
elseif op=="square"
push!(prog, string(var, '=', arg1, '*', arg1))
elseif op=="sqrt"
push!(prog, string(var, "=sqrt(",arg1,")"))
else
error("unknown op: $op /$line/")
end
end
push!(prog, "end")
pp = Meta.parse(join(prog, ';'))
@eval $pp
end
const FUN = parseit("prospero.vm")
function (@main)(ARGS)
image_size = ARGS[1]
image_size = parse(Int, image_size)
space = range(-one(T), one(T), image_size) |> collect |> device
O = UInt8(255)
f(si,sj) = O*(FUN(si,sj) < 0)
out = f.(space,-space')
out_cpu= collect(out)
open("out.ppm", "w") do f
write(f, "P5\n$(image_size) $(image_size)\n255\n")
write(f, vec(out_cpu))
end
nothing
return nothing
end
function bench_proper(ARGS)
b = @benchmark main($ARGS)
display(b)
return nothing
end
function profile_cuda(ARGS)
main(ARGS)
display(CUDA.@profile main(ARGS))
return nothing
end
export main, bench_proper, profile_cuda
end # module ProsperoChal
This should definetly be a kernel
f(si,sj) = O*(FUN(si,sj) < 0)
out = f.(space,-space')
but whatever.
I think its good as it is ? Note that I only time execution and not parsing since I can’t because of world-age cuda issues.
2.8 ms. Not bad.
ProsperoChal.bench_proper(["1024"])
BenchmarkTools.Trial: 1673 samples with 1 evaluation per sample.
Range (min … max): 2.607 ms … 12.196 ms ┊ GC (min … max): 0.00% … 69.16%
Time (median): 2.769 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.973 ms ± 588.800 μs ┊ GC (mean ± σ): 4.63% ± 10.48%
▁▆█▇▆▅▆▅▃▁ ▂▁ ▁▁ ▁
███████████▇▇▄▇▆▅▄▅▄▄▅▅▆▇██▄▁▅▅▅▆███▇▅▄▅▇▅▅▅▄▄▁▁▅▁▄▅▇▇█▇▅▅▇ █
2.61 ms Histogram: log(frequency) by time 4.98 ms <
Memory estimate: 2.01 MiB, allocs estimate: 256.
You have a better gpu then me for sure I’m at 10ms but yeah its great !
Yes, I’m aware of the perils of benchmarking in global scope, but I was lazy. And yes, it would have been only a single extra line of code. All I wanted to show was that the total runtime for the 1024x1024 case was still about a million times the single invocation time from @btime and that even for my suboptimal setup, the CPU and (especially memory) usage were much lower than the author’s baseline Python.
From the author, this thread’s OP, and the discussion on hackernews, it seems more that people were interested in implementing their own parsers compiler optimizations. So translating back to a different high-level language that compiles using LLVM seemed like a bit of a cheat, but you are right, it does give a LLVM baseline.
I didn’t write code to translate prospero.vm. I just did a handful of search and replace operations in vim. These can be replicated in Julia with match
. For instance, this vim
:%s/\(_[a-f0-9]*\) add \([^ ]*\) \([^ ]*\)/\1 = \2 + \3/
could be
m = match(r"(_[a-f0-9]*) add ([^ ]*) ([^ ]*)", s)
if m!==nothing
println("$(m.captures[1]) = $(m.captures[2]) + $(m.captures[3])")
end
Of course, these regular expressions could more flexible or rigorous, but they work for the example file.
Quite happy with it now GitHub - yolhan83/ProsperoChal: ProsperoChal is an attempt to the Prospero Challenge in julia. easy to switch gpu-cpu and better TTFX using KernelAbstractions.jl to fuse the operation (f.(space,-space') .< 0).*0xff
into only one kernel. I mark this as solution but if someone else wants to give it a try, go on !
I don’t know about GPUs, but on my CPU, using ifelse
instead of max
and min
lead to significant performance improvements. ifelse
produces cmov
or vmax/min
whereas min
and max
may produce jmp
instructions which may cause pipeline stalls. Though, I’m not certain what @fastmath
does with it, probably using ifelse
.
Hey @yolhan_mannes – I tried to reply to your email but it bounced, so I figured I’d reach out here. Below is my proposed blurb for the site; I’ll wait for your confirmation / edits before posting:
Julia metaprogramming
Yolhan Mannes et al, 2025Github repo | Forum thread (these will be links)
This implementation is a great demonstration of Julia’s flexibility and metaprogramming capabilities. Using relatively straight-forward code, the expression is parsed and compiled at runtime, then evaluated either on the GPU or CPU depending on selected backend. GPU evaluation is about 10 ms/frame on a NVIDIA GeForce 4060, while the CPU backend takes around 492 ms. Julia uses LLVM for its JIT compiler, so the resulting functions are fast to evaluate but costly to compute; the authors see about 13 seconds of compilation time.
(I think I’m reading the compilation time correctly, from the first Measure-Command example in your repo)
That’s a really nice summary thank you its perfect !
Okay, it’s now published
Thanks for the contribution, it’s always neat to see Julia in action!