[ANN] MonkeyLang.jl

lucifer1004 · January 15, 2022, 12:29pm

Just implemented Writing an Interpreter in GO in Julia.

I think I have learned quite a lot during this process, and I would like to share my project MonkeyLang.jl with the community.

jdad · January 16, 2022, 7:08am

Question : have you implemented (if not, will you implement) macros? cf. The Lost Chapter.

And, what is your feeling about how it compares with other (Go? Rust? …) implementations (eg. facility of writing, execution speed, …).

Just curiosity, no need to do anything. Thanks for the work and the sharing!

lucifer1004 · January 16, 2022, 8:00am

The Lost Chapter is definitely on my schedule, but I have not started yet.

I have not done any benchmarking. But there are a few points I would like to mention:

Julia’s multiple dispatches work like a charm when it comes to evaluating AST nodes.
With Julia’s abstract type graph, I do not need to implement dummy methods (to distinguish expressions with statements) as the GO version does.
The implementation of Hash is greatly simplified compared to the GO version, since Julia’s native hash() function works well in all cases I have met till now.

lucifer1004 · January 16, 2022, 10:46am

By the way, CoPilot worked surprisingly during the whole process. I think it must have learned a lot from other implementations of “Writing an Interpreter in GO”, since it yielded the exact test cases even before I typed them in.

lucifer1004 · January 16, 2022, 2:57pm

I am proud to say that I have finished the Lost Chapter. After implementing it, I am much more confident to dig deeper into Julia’s macro system now.

jdad · January 16, 2022, 3:37pm

Bravo for the speed for completing the Chapter!

lucifer1004 · January 27, 2022, 6:15am

Update:

In the compiler branch, I have implemented the Compiler book, but the current benchmark shows that the VM version runs 3x slower than the evaluator version. For fib(35), the evaluator takes ~27s, while the VM takes ~81s. (As a reference, the author’s GO version is ~27s v.s. ~9s.)

The major difference from the book is that I used mutable-length vectors as VM stacks. Do you have any ideas or suggestions? @jdad

jdad · January 27, 2022, 8:08am

Thanks for the news (bravo for completing The Book) and the pinging. At first glance I see two points in what you report

The « compiled » (Julia version) when executed is 3 times slower than the interpreted (Julia) one. This is concerning, but I wonder if in fact interpreted is here similar to AOT executed, so with Julia it becomes relatively fast, maybe faster that the VM with associated overhead. But I would defer thinking about that after checking second point,
I understand that on your machine (laptop), interpreter in Go and in Julia have similar speed. However, going to the byte code VM, the one coded in Go in 3 times faster, whereas the one in Julia is 3 times slower. Do I understand correctly? If yes, then I would suggest to try to compare both VM on a small fixed input, e.g. byte code stream for fib(35), and then try to have an MWE so that specialists from this forum could take a look. First step could be to profile the code, and why not if possible compare to Go version.( I do not know if/how to get Go profile view).

Other specialists from this forum could comment about the hypothesis that mutable objects are inherently slower that unmutables ones. You could check allocations as an easy first step (see Julia command line related option), before checking the profile ?

lucifer1004 · January 27, 2022, 8:21am

Attached is a benchmarking result and the allocation data. Added a Julia-native version for comparison.

using BenchmarkTools
using MonkeyLang

const m = MonkeyLang

fib(x) = begin
    if x == 0
        0
    else
        if x == 1
            1
        else
            fib(x - 1) + fib(x - 2)
        end
    end
end

input = """
let fibonacci = fn(x) {
    if (x == 0) {
        0
    } else {
        if (x == 1) {
            return 1;
        } else {
            fibonacci(x - 1) + fibonacci(x - 2);
        }
    }
};

fibonacci(35);
"""

println("=== Julia native ===")
@btime fib(35)

println("=== Using evaluator ===")
@btime m.evaluate($input)

println("=== Using compiler ===")
@btime m.run($input)

julia --project=. benchmark/fibonacci.jl
=== Julia native ===
  39.129 ms (0 allocations: 0 bytes)
=== Using evaluator ===
  24.790 s (550245834 allocations: 36.68 GiB)
=== Using compiler ===
  76.079 s (1256843411 allocations: 38.35 GiB)

lucifer1004 · January 27, 2022, 8:25am

fib(15) yields similar performance gaps:

julia --project=. benchmark/fibonacci.jl
=== Julia native ===
  2.642 μs (0 allocations: 0 bytes)
=== Using evaluator ===
  1.509 ms (37494 allocations: 2.52 MiB)
=== Using compiler ===
  4.759 ms (84391 allocations: 2.65 MiB)

c42f · January 27, 2022, 10:20am

I had a bit of a browse through the code you’ve got there on the compiler branch in https://github.com/lucifer1004/MonkeyLang.jl/blob/compiler/src/vm.jl

I’d be surprised if using push! and pop! on Vector{Object} in vm.stack is causing any performance problems.

I’d be less surprised if all the boxing, unboxing and dynamic dispatch on Object subtypes might be quite slow. It seems the inner loop generally involves getting some Object operands from the stack, dynamic dispatch to the correct method for those, allocation of a new Object subtype as output and storing this back on the stack. And all of this is much more work than is actually done on the content of the object once it’s been unboxed, so pretty much a worst case for stressing Julia’s dynamic dispatch and GC!

This doesn’t explain why the bytecode VM would be slower than the tree interpreter though. If both are written against the same object model I’d expect both to suffer the same problems of heavy dynamic dispatch and GC load.

lucifer1004 · January 27, 2022, 10:37am

Thanks, @c42f!

As you said lastly, it is not the performance gap between Julia and other languages am I concerned about, but the gap between the interpreter and the bytecode VM. And I have been trying to figure this out. I did find that some bound-checking led to ~10% performance loss, but obviously, that is not the major problem.

Concerning the performance loss caused by the dynamic dispatch, what would you suggest as an alternative to the object model? In the original GO version, the author used a long switch ... case ... condition to dispatch to the correct method. I chose the current way because I thought it would be more Julian.

c42f · January 27, 2022, 11:54am

This is definitely what I’d try. The dispatch is critical for performance here and there’s no great need for it to be extensible in terms of adding new methods or Object subtypes. So it makes sense to control the dispatch yourself and specialize it to the use case rather than relying on Julia’s very general dispatch mechanisms.

I definitely agree. The approach you have makes great use of Julia’s dispatch facilities.

Though as a couple of cheeky counterpoints

Reusing the host language’s (extremely!) sophisticated dispatch facilities to implement the dispatch in a toy language feels a bit like cheating. A core part of understanding a bytecode interpreter is to understand exactly how dispatch works.
It’s also very Julian to want the best performance and be dissatisfied until we’re within a small multiple of C speed, or beating it

lucifer1004 · January 27, 2022, 12:13pm

In the VM version, the bottleneck is the actual execution of the bytecode, not the compilation. However, in vm.jl, the only use of multiple dispatches is the call!() function which applies to both ClosureObj and Builtin. I just rewrote this part as a simple if ... else ... clause, however, the performance showed no improvements.

So I guess the use of multiple dispatches might not be the key issue here.

c42f · January 27, 2022, 8:28pm

I thought I spotted various of cases of dynamic dispatch, for example dispatch on the Object subtype in execute_binary_operation!.

But when you’re working with Vector{Object} most things become dynamic dispatch. For example

julia> stack = Vector{MonkeyLang.Object}()
MonkeyLang.Object[]

julia> push!(stack, MonkeyLang.IntegerObj(10))
1-element Vector{MonkeyLang.Object}:
 10

julia> function f(stack)
           obj = stack[1]
           obj.value
       end

Because obj::Object is an abstract type in f, everything you do with it turns into dynamic dispatch. For example the . in obj.value becomes getproperty(obj, :value) which is a dynamic dispatch. You can see this with @code_warntype:

2022-01-28-061706_595x254_scrot

Another way to see some of these issues is to use the amazing JET.jl:

julia> JET.report_opt(MonkeyLang.run!, (MonkeyLang.VM,))

# ... massive list of cases

┌ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:208 Base.setindex!(%1396, %1395)
│ runtime dispatch detected: Base.setindex!(%1396::Ref{Int64}, %1395::Int64)
└───────────────────────────────────────────────────
┌ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:209 MonkeyLang.push!(vm, %1366)
│ runtime dispatch detected: MonkeyLang.push!(vm::MonkeyLang.VM, %1366::Any)
└───────────────────────────────────────────────────
┌ @ /home/chris/.julia/dev/MonkeyLang/src/vm.jl:212 Base.setindex!(%1438, %1437)
│ runtime dispatch detected: Base.setindex!(%1438::Ref{Int64}, %1437::Int64)
└───────────────────────────────────────────────────

Perhaps some of these cases can’t be helped, but if there’s opcodes which expect only certain object types, adding some type assertions or branches on isa checks may help a lot. For example, if we change f as follows, the output type is inferrable:

julia> function f(stack)
           obj = stack[1]
           if obj isa MonkeyLang.IntegerObj
               obj.value
           else
               0
           end
       end

jdad · January 28, 2022, 12:31pm

I noted the very high number of allocations in your fib(35) test

so I did a track-allocation run for the compile run
time julia --project --track-allocation=user tfib35d.jl

After some processing of the *.mem files (warning, they are created in the directory of each source file, with extension .pid.mem where pid is the PID of the julia process; I had to repatriate them to a single directory before processing), I get

11066672288             const_id = read_uint16(ins[ip+1:ip+2]) + 1               |  vm.jl.408.mem
6688797472             pos = read_uint16(ins[ip+1:ip+2])                         |  vm.jl.408.mem
6415057712         vm.stack[vm.sp[]] = obj                                       |  vm.jl.408.mem
6323812608             execute_binary_operation!(vm, op, left, right)            |  vm.jl.408.mem
4742859504                 push!(vm, vm.constants[const_id])                     |  vm.jl.408.mem
4026202560                 push!(vm, vm.stack[frame.base_ptr+local_id])          |  vm.jl.408.mem
1433313696             push!(vm, return_value)                                   |  vm.jl.408.mem
477771248         frame = Frame(cl, vm.sp[] - arg_count)                         |  vm.jl.408.mem
    13440         ch, state = l.next                                             |  lexer.jl.408.mem
     5088     while isspace(read_char(l))                                        |  lexer.jl.408.mem
     3520         push!(chars, read_char(l))                                     |  lexer.jl.408.mem
...

So I would encourage you to take a look to those lines - my first idea would be to put a @view before each access to array (or @views before the full line or rhs?)

Hope this help,

jdad · January 28, 2022, 1:34pm

FWIW, I also did a profiling with PProf. Maybe there are also insights in it ?

c42f · January 28, 2022, 10:47pm

This is an excellent place to start improving things. The problem here is that ins[ip+1:ip+2] creates a temporary Instructions object (which allocates a temporary codes array) which is then passed to read_uint16. This one is easy to fix by rewriting read_uint16 slightly:

function read_uint16(ins, ip)
    return (UInt16(ins.codes[ip+1]) << 8) | ins.codes[ip+2]
end

lucifer1004 · January 28, 2022, 11:24pm

You are right.

This change reduced allocations by ~40%, and reduced run time by ~20%.

lucifer1004 · January 29, 2022, 3:13pm

@jdad @c42f I think I have found a key point.

In the current implementation, I used many Ref{Int} to avoid using mutable struct. However, after I turned the stack pointer of VM from Ref{Int} to Int, the run time was reduced by 50%!!

After I further changed the pointer of Frame from Ref{Int} to Int, the run time was reduced by another 60%!!

Here are the latest benchmark results of fib(35):

=== Julia native ===
  38.751 ms (0 allocations: 0 bytes)
=== Using evaluator ===
  24.024 s (550245831 allocations: 36.68 GiB)
=== Using compiler ===
  8.417 s (371081870 allocations: 7.75 GiB)

Topic		Replies	Views
[ANN] MonkeyLang.jl v0.2 -- Now with CLI + Mutable variables + While loop Package Announcements	8	795	February 3, 2022
Does Julia Create a "1.5" Language Problem? Teaching & Outreach question , discussion	133	5804	January 1, 2024
Programming Language Benchmark 2 Performance	25	3409	April 8, 2024
Yet another language benchmark Performance benchmark	9	918	June 15, 2025
Julia programs now shown on benchmarks game website Community announcement	144	13729	December 3, 2019

[ANN] MonkeyLang.jl

Related topics