[ANN] MonkeyLang.jl

Just implemented Writing an Interpreter in GO in Julia.

I think I have learned quite a lot during this process, and I would like to share my project MonkeyLang.jl with the community.

22 Likes

Question : have you implemented (if not, will you implement) macros? cf. The Lost Chapter.

And, what is your feeling about how it compares with other (Go? Rust? …) implementations (eg. facility of writing, execution speed, …).

Just curiosity, no need to do anything. Thanks for the work and the sharing!

3 Likes

The Lost Chapter is definitely on my schedule, but I have not started yet.

I have not done any benchmarking. But there are a few points I would like to mention:

  • Julia’s multiple dispatches work like a charm when it comes to evaluating AST nodes.
  • With Julia’s abstract type graph, I do not need to implement dummy methods (to distinguish expressions with statements) as the GO version does.
  • The implementation of Hash is greatly simplified compared to the GO version, since Julia’s native hash() function works well in all cases I have met till now.
3 Likes

By the way, CoPilot worked surprisingly during the whole process. I think it must have learned a lot from other implementations of β€œWriting an Interpreter in GO”, since it yielded the exact test cases even before I typed them in.

2 Likes

I am proud to say that I have finished the Lost Chapter. After implementing it, I am much more confident to dig deeper into Julia’s macro system now.

3 Likes

Bravo for the speed for completing the Chapter!

2 Likes

Update:

In the compiler branch, I have implemented the Compiler book, but the current benchmark shows that the VM version runs 3x slower than the evaluator version. For fib(35), the evaluator takes ~27s, while the VM takes ~81s. (As a reference, the author’s GO version is ~27s v.s. ~9s.)

The major difference from the book is that I used mutable-length vectors as VM stacks. Do you have any ideas or suggestions? @jdad

Thanks for the news (bravo for completing The Book) and the pinging. At first glance I see two points in what you report

  1. The Β« compiled Β» (Julia version) when executed is 3 times slower than the interpreted (Julia) one. This is concerning, but I wonder if in fact interpreted is here similar to AOT executed, so with Julia it becomes relatively fast, maybe faster that the VM with associated overhead. But I would defer thinking about that after checking second point,

  2. I understand that on your machine (laptop), interpreter in Go and in Julia have similar speed. However, going to the byte code VM, the one coded in Go in 3 times faster, whereas the one in Julia is 3 times slower. Do I understand correctly? If yes, then I would suggest to try to compare both VM on a small fixed input, e.g. byte code stream for fib(35), and then try to have an MWE so that specialists from this forum could take a look. First step could be to profile the code, and why not if possible compare to Go version.( I do not know if/how to get Go profile view).

Other specialists from this forum could comment about the hypothesis that mutable objects are inherently slower that unmutables ones. You could check allocations as an easy first step (see Julia command line related option), before checking the profile ?

1 Like

Attached is a benchmarking result and the allocation data. Added a Julia-native version for comparison.

using BenchmarkTools
using MonkeyLang

const m = MonkeyLang

fib(x) = begin
    if x == 0
        0
    else
        if x == 1
            1
        else
            fib(x - 1) + fib(x - 2)
        end
    end
end

input = """
let fibonacci = fn(x) {
    if (x == 0) {
        0
    } else {
        if (x == 1) {
            return 1;
        } else {
            fibonacci(x - 1) + fibonacci(x - 2);
        }
    }
};

fibonacci(35);
"""

println("=== Julia native ===")
@btime fib(35)

println("=== Using evaluator ===")
@btime m.evaluate($input)

println("=== Using compiler ===")
@btime m.run($input)
julia --project=. benchmark/fibonacci.jl
=== Julia native ===
  39.129 ms (0 allocations: 0 bytes)
=== Using evaluator ===
  24.790 s (550245834 allocations: 36.68 GiB)
=== Using compiler ===
  76.079 s (1256843411 allocations: 38.35 GiB)

fib(15) yields similar performance gaps:

julia --project=. benchmark/fibonacci.jl
=== Julia native ===
  2.642 ΞΌs (0 allocations: 0 bytes)
=== Using evaluator ===
  1.509 ms (37494 allocations: 2.52 MiB)
=== Using compiler ===
  4.759 ms (84391 allocations: 2.65 MiB)

I had a bit of a browse through the code you’ve got there on the compiler branch in https://github.com/lucifer1004/MonkeyLang.jl/blob/compiler/src/vm.jl

I’d be surprised if using push! and pop! on Vector{Object} in vm.stack is causing any performance problems.

I’d be less surprised if all the boxing, unboxing and dynamic dispatch on Object subtypes might be quite slow. It seems the inner loop generally involves getting some Object operands from the stack, dynamic dispatch to the correct method for those, allocation of a new Object subtype as output and storing this back on the stack. And all of this is much more work than is actually done on the content of the object once it’s been unboxed, so pretty much a worst case for stressing Julia’s dynamic dispatch and GC!

This doesn’t explain why the bytecode VM would be slower than the tree interpreter though. If both are written against the same object model I’d expect both to suffer the same problems of heavy dynamic dispatch and GC load.

3 Likes

Thanks, @c42f!

As you said lastly, it is not the performance gap between Julia and other languages am I concerned about, but the gap between the interpreter and the bytecode VM. And I have been trying to figure this out. I did find that some bound-checking led to ~10% performance loss, but obviously, that is not the major problem.

Concerning the performance loss caused by the dynamic dispatch, what would you suggest as an alternative to the object model? In the original GO version, the author used a long switch ... case ... condition to dispatch to the correct method. I chose the current way because I thought it would be more Julian.

This is definitely what I’d try. The dispatch is critical for performance here and there’s no great need for it to be extensible in terms of adding new methods or Object subtypes. So it makes sense to control the dispatch yourself and specialize it to the use case rather than relying on Julia’s very general dispatch mechanisms.

I definitely agree. The approach you have makes great use of Julia’s dispatch facilities.

Though as a couple of cheeky counterpoints

  • Reusing the host language’s (extremely!) sophisticated dispatch facilities to implement the dispatch in a toy language feels a bit like cheating. A core part of understanding a bytecode interpreter is to understand exactly how dispatch works.
  • It’s also very Julian to want the best performance and be dissatisfied until we’re within a small multiple of C speed, or beating it :smiley:
1 Like

In the VM version, the bottleneck is the actual execution of the bytecode, not the compilation. However, in vm.jl, the only use of multiple dispatches is the call!() function which applies to both ClosureObj and Builtin. I just rewrote this part as a simple if ... else ... clause, however, the performance showed no improvements.

So I guess the use of multiple dispatches might not be the key issue here.

I thought I spotted various of cases of dynamic dispatch, for example dispatch on the Object subtype in execute_binary_operation!.

But when you’re working with Vector{Object} most things become dynamic dispatch. For example

julia> stack = Vector{MonkeyLang.Object}()
MonkeyLang.Object[]

julia> push!(stack, MonkeyLang.IntegerObj(10))
1-element Vector{MonkeyLang.Object}:
 10

julia> function f(stack)
           obj = stack[1]
           obj.value
       end

Because obj::Object is an abstract type in f, everything you do with it turns into dynamic dispatch. For example the . in obj.value becomes getproperty(obj, :value) which is a dynamic dispatch. You can see this with @code_warntype:

2022-01-28-061706_595x254_scrot

Another way to see some of these issues is to use the amazing JET.jl:

julia> JET.report_opt(MonkeyLang.run!, (MonkeyLang.VM,))

# ... massive list of cases

β”Œ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:208 Base.setindex!(%1396, %1395)
β”‚ runtime dispatch detected: Base.setindex!(%1396::Ref{Int64}, %1395::Int64)
└───────────────────────────────────────────────────
β”Œ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:209 MonkeyLang.push!(vm, %1366)
β”‚ runtime dispatch detected: MonkeyLang.push!(vm::MonkeyLang.VM, %1366::Any)
└───────────────────────────────────────────────────
β”Œ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:212 Base.setindex!(%1438, %1437)
β”‚ runtime dispatch detected: Base.setindex!(%1438::Ref{Int64}, %1437::Int64)
└───────────────────────────────────────────────────

Perhaps some of these cases can’t be helped, but if there’s opcodes which expect only certain object types, adding some type assertions or branches on isa checks may help a lot. For example, if we change f as follows, the output type is inferrable:

julia> function f(stack)
           obj = stack[1]
           if obj isa MonkeyLang.IntegerObj
               obj.value
           else
               0
           end
       end
1 Like

I noted the very high number of allocations in your fib(35) test

so I did a track-allocation run for the compile run
time julia --project --track-allocation=user tfib35d.jl

After some processing of the *.mem files (warning, they are created in the directory of each source file, with extension .pid.mem where pid is the PID of the julia process; I had to repatriate them to a single directory before processing), I get

11066672288             const_id = read_uint16(ins[ip+1:ip+2]) + 1               |  vm.jl.408.mem
6688797472             pos = read_uint16(ins[ip+1:ip+2])                         |  vm.jl.408.mem
6415057712         vm.stack[vm.sp[]] = obj                                       |  vm.jl.408.mem
6323812608             execute_binary_operation!(vm, op, left, right)            |  vm.jl.408.mem
4742859504                 push!(vm, vm.constants[const_id])                     |  vm.jl.408.mem
4026202560                 push!(vm, vm.stack[frame.base_ptr+local_id])          |  vm.jl.408.mem
1433313696             push!(vm, return_value)                                   |  vm.jl.408.mem
477771248         frame = Frame(cl, vm.sp[] - arg_count)                         |  vm.jl.408.mem
    13440         ch, state = l.next                                             |  lexer.jl.408.mem
     5088     while isspace(read_char(l))                                        |  lexer.jl.408.mem
     3520         push!(chars, read_char(l))                                     |  lexer.jl.408.mem
...

So I would encourage you to take a look to those lines - my first idea would be to put a @view before each access to array (or @views before the full line or rhs?)

Hope this help,

2 Likes

FWIW, I also did a profiling with PProf. Maybe there are also insights in it ?

2 Likes

This is an excellent place to start improving things. The problem here is that ins[ip+1:ip+2] creates a temporary Instructions object (which allocates a temporary codes array) which is then passed to read_uint16. This one is easy to fix by rewriting read_uint16 slightly:

function read_uint16(ins, ip)
    return (UInt16(ins.codes[ip+1]) << 8) | ins.codes[ip+2]
end
3 Likes

You are right.

This change reduced allocations by ~40%, and reduced run time by ~20%.

@jdad @c42f I think I have found a key point.

In the current implementation, I used many Ref{Int} to avoid using mutable struct. However, after I turned the stack pointer of VM from Ref{Int} to Int, the run time was reduced by 50%!!

After I further changed the pointer of Frame from Ref{Int} to Int, the run time was reduced by another 60%!!

Here are the latest benchmark results of fib(35):

=== Julia native ===
  38.751 ms (0 allocations: 0 bytes)
=== Using evaluator ===
  24.024 s (550245831 allocations: 36.68 GiB)
=== Using compiler ===
  8.417 s (371081870 allocations: 7.75 GiB)
1 Like