Very large compilation time (and space)

In a previous post, I mentioned a small interpreter that I wrote both in direct style and in continuation-passing style. In the first case, Julia very quickly compiles the interpreter methods as they are being used. In the second case, however, this process takes inordinate amounts of time even for trivial inputs and, for simple but non-trivial inputs, it does not even finish because it exhausts the memory (32 GB).

Given the size of the interpreter, it was not practical to show it here so I tried to create a MWE but this was not an easy task because even small changes in the interpreter code dramatically changed the compilation time. After quite a bit of time, I managed to produce a tiny interpreter that already demonstrates the problem, although at a smaller scale. Here it is:

eval_expr(expr, env, cont) =
  if expr isa Number || expr isa LineNumberNode
    cont(expr)
  elseif expr isa Symbol
    eval_name(expr, env, cont)
  elseif expr isa Expr && expr.head === :if
    eval_if(expr, env, cont)
  elseif expr isa Expr && expr.head === :block
    eval_begin(expr, env, cont)
  else
    eval_call(expr, env, cont)
  end

eval_name(expr, env, cont) = cont(eval(expr))

eval_if(expr, env, cont) =
  eval_expr(expr.args[1], env, r ->
    r ?
      eval_expr(expr.args[2], env, cont) :
      eval_expr(expr.args[3], env, cont))

eval_begin(expr, env, cont) =
  let exprs = expr.args,
      eval_sequentially(i) =
        i == length(exprs) ?
          eval_expr(exprs[i], env, cont) :
          eval_expr(exprs[i], env, r -> eval_sequentially(i + 1))
    eval_sequentially(1)
  end

eval_call(expr, env, cont) =
  eval_expr(expr.args[1], env, func ->
    eval_exprs(expr.args[2:end], env, args ->
        cont(func(args...))))

eval_exprs(exprs, env, cont) =
  length(exprs) == 0 ?
    cont([]) :
    eval_expr(exprs[1], env, r ->
      eval_exprs(exprs[2:end], env, rs -> cont([r, rs...])))

Despite being written in continuation-passing style, even trivial inputs seem to demand too much effort from the Julia compiler. Here are a few examples:

@time eval_expr(Meta.parse("1+2"), false, identity)
@time eval_expr(Meta.parse("1+2*3"), false, identity)
@time eval_expr(Meta.parse("1+2*3/4"), false, identity)
@time eval_expr(Meta.parse("1 > 2 ? 1 : 0"), false, identity)
@time eval_expr(Meta.parse("1+2 < 2+3 ? (3*4; 4*5) : 5*6"), false, identity)

Despite the relatively simple inputs, I get the following times (Julia 1.2.0, Intel I7-6700K, 32 GB RAM):

  1.545639 seconds (5.65 M allocations: 364.698 MiB, 4.91% gc time)
  2.273712 seconds (8.76 M allocations: 573.199 MiB, 5.33% gc time)
  4.096212 seconds (15.80 M allocations: 1.007 GiB, 6.11% gc time)
  6.431456 seconds (23.77 M allocations: 1.533 GiB, 6.70% gc time)
  8.575267 seconds (31.03 M allocations: 1.976 GiB, 5.30% gc time)

This is getting already painful. However, if we redefine the code and try just the final test, we get:

 19.221149 seconds (71.77 M allocations: 4.613 GiB, 6.80% gc time)

This does not look like an acceptable time for interactive experimentation. OTOH, this is just a MWE: in the actual interpreter, it does not even finish the compilation process.

BTW, in my experiments, I noticed that the number of branches in the eval_expr function seems to be very relevant. If you remove just one, compilation times reduce significantly. Similarly, the number of parameters seems to be relevant: you might have noticed a dummy env parameter that is not being used. If you remove it, compilation times reduce significantly.

I know that I can start injecting type declaration, promote type stability, etc, etc, in the hope of improving the performance but the problem is that I do not want to do that because the interpreter was designed for pedagogical purposes: I want to show my students a direct-style interpreter, to which I incrementally add stuff, such as different scope regimes, short-circuit evaluation, macros, lazy evaluation, and so on. To this end, I want to focus on the evaluation process and I do not want to bother my students with details that are only relevant for the Julia compiler. Moreover, I want to do this incrementally, by redefining the relevant parts of the interpreter to accomodate changes in the semantics of the interpreted language.

This approach was working beautifully until I got to the point where I needed to convert the interpreter to continuation-passing style and found the performance problems that I mentioned.

Do you have any suggestions for addressing these problems?
Thanks.

2 Likes

Maybe sprinkle some @nospecialize annotations around?

Instead of sprinkling you could:

module Eval

@nospecialize # applies it to all non typed arguments

eval code....

end
julia> @time Eval.eval_expr(Meta.parse("1 > 2 ? 1 : 0"), false, identity)
  0.257247 seconds (816.17 k allocations: 48.106 MiB, 2.62% gc time)
0

julia> @time Eval.eval_expr(Meta.parse("1+2 < 2+3 ? (3*4; 4*5) : 5*6"), false, identity)
  0.391338 seconds (1.25 M allocations: 69.237 MiB, 2.11% gc time)
20

The code here serves as a pretty good compiler performance benchmark.

The @nospecialized macro call sounds like an excellent idea. However, in my (admittedly, limited) testing, it does not seem to have any effect for incremental definition or redefinition of methods, which is an ability that I would like to preserve so that students can experiment changes in the interpreter without having to reload the entire module.

Is there a way of making it work without requiring loading the entire module in one go?

Thanks.

Maybe I don’t understand the problem you’re having, but to me it seems that you can redefine methods just fine; no different from when there’s no @nospecialize. The code also doesn’t have to live in a module if you prefer not to do that. Also, @nospecialize fundamentally applies to methods, and another way to use it is to just put @nospecialize as the first statement in a method definition (see ? @nospecialize for more options). You don’t have to use the @nospecialize-@specialize block form that @kristoffer.carlsson showed.

You are right. @nospecialize (almost) works as expected in the MWE I provided.

The reason for my previous wrong assumptions is that I started by testing it in the full interpreter and it didn’t seem to solve the problem. After further testing, I can confirm that it does not have the redefinition problems I mentioned and, moreover, it improves the performance but, unfortunately, not sufficiently.

As a case in point, these are the timings for the evaluation of the expression (2 + 3)*(4 + 5), in the full interpreter without using the @nospecialize macro call:

1601.936483 seconds (2.82 G allocations: 187.556 GiB, 15.18% gc time)

Repeating the experiment with the @nospecialize macro call, the performance improves considerably:

  9.314756 seconds (31.40 M allocations: 2.028 GiB, 7.54% gc time)

From 1601.9 s to 9.3 s is a huge improvement but 9.3 is still not good. It is still too slow and, for more complex examples, times grow considerably. But there is a second problem: when using the interpreter in continuation-passing style, Julia repeats the specialization process much more often than it does when the interpreter is written in direct style. I presume this is caused by the continuation that is being provided as an argument.

To make it easier to test this, I rewrote the MWE in direct style (but preserving the continuation as a dummy parameter just to avoid effects related to a different number of parameters):

eval_expr(expr, env, cont) =
  if expr isa Number || expr isa LineNumberNode
    expr
  elseif expr isa Symbol
    eval_name(expr, env, cont)
  elseif expr isa Expr && expr.head === :if
    eval_if(expr, env, cont)
  elseif expr isa Expr && expr.head === :block
    eval_begin(expr, env, cont)
  else
    eval_call(expr, env, cont)
  end

eval_name(expr, env, cont) = eval(expr)

eval_if(expr, env, cont) =
  eval_expr(expr.args[1], env, cont) ?
    eval_expr(expr.args[2], env, cont) :
    eval_expr(expr.args[3], env, cont)

eval_begin(expr, env, cont) =
  let exprs = expr.args,
      eval_sequentially(i) =
        i == length(exprs) ?
          eval_expr(exprs[i], env, cont) :
          (eval_expr(exprs[i], env, cont); eval_sequentially(i + 1))
    eval_sequentially(1)
  end

eval_call(expr, env, cont) =
  let func = eval_expr(expr.args[1], env, cont),
      args = eval_exprs(expr.args[2:end], env, cont)
    func(args...)
  end

eval_exprs(exprs, env, cont) =
  length(exprs) == 0 ?
    [] :
    [eval_expr(exprs[1], env, cont), eval_exprs(exprs[2:end], env, cont)...]

First, let’s look at the timings using the same trivial tests. To that end, for each case, we restart Julia and evaluate the following expressions:

@time eval_expr(Meta.parse("1+2"), false, identity)
@time eval_expr(Meta.parse("1+2*3"), false, identity)
@time eval_expr(Meta.parse("1+2*3/4"), false, identity)
@time eval_expr(Meta.parse("1 > 2 ? 1 : 0"), false, identity)
@time eval_expr(Meta.parse("1+2 < 2+3 ? (3*4; 4*5) : 5*6"), false, identity)

Continuation-passing style, without @nospecialize:

  1.610372 seconds (5.65 M allocations: 364.698 MiB, 5.11% gc time)
  2.532585 seconds (8.76 M allocations: 573.198 MiB, 6.40% gc time)
  4.251689 seconds (15.80 M allocations: 1.007 GiB, 5.19% gc time)
  6.671415 seconds (23.77 M allocations: 1.533 GiB, 6.70% gc time)
  8.684811 seconds (31.03 M allocations: 1.976 GiB, 5.56% gc time)

Continuation-passing style, with @nospecialize:

  0.112641 seconds (417.58 k allocations: 22.052 MiB, 4.29% gc time)
  0.056031 seconds (209.10 k allocations: 11.338 MiB)
  0.144246 seconds (493.07 k allocations: 27.221 MiB, 2.91% gc time)
  0.173015 seconds (603.53 k allocations: 37.128 MiB, 4.92% gc time)
  0.305922 seconds (1.18 M allocations: 65.529 MiB, 3.82% gc time)

Finally, direct-style, without @nospecialize:

  0.052441 seconds (132.04 k allocations: 6.764 MiB)
  0.000121 seconds (34 allocations: 1.844 KiB)
  0.017284 seconds (41.94 k allocations: 2.191 MiB)
  0.000099 seconds (25 allocations: 1.234 KiB)
  0.010037 seconds (19.96 k allocations: 1.010 MiB)

And, just for completeness, direct-style, with @nospecialize:

  0.033083 seconds (60.24 k allocations: 3.147 MiB)
  0.000120 seconds (40 allocations: 2.031 KiB)
  0.019334 seconds (41.95 k allocations: 2.192 MiB)
  0.000127 seconds (28 allocations: 1.328 KiB)
  0.013354 seconds (23.48 k allocations: 1.129 MiB)

Now, let’s look at the extra specializations done. To that end, I used SnoopCompile again and tested as follows:

println(length(@snoopi eval_expr(Meta.parse("(2 + 3)*(4 + 5)"), false, identity)))
println(length(@snoopi eval_expr(Meta.parse("(2 * 3)/(4 * 6)"), false, identity)))

Note that @snoopi returns an array with the timings related to method compilation. The results are:

Continuation-passing style, without @nospecialize:

34
30

Continuation-passing style, with @nospecialize:

16
11

Direct-style, without @nospecialize:

4
0

Direct-style, with @nospecialize:

2
0

Note that, differently from continuation-passing style, in direct style, the second expression is evaluated without any additional specializations.

I expected that the use of CPS in Julia would cause a performance penalty, but I didn’t expect that it would be so large, to the point of impeding the use of the interpreter. Using @nospecialize reduces the performance penalty but it does not make the problem disappear: the CPS interpreter remains unusable, in particular, because it keeps specializing methods again and again.

Suggestions?

Thanks.

You can always interpret your interpreter :wink:

julia> using JuliaInterpreter

julia> @time @interpret eval_expr(Meta.parse("1+2*3"), false, identity)
  0.007653 seconds (25.68 k allocations: 1.172 MiB)
7

julia> @time @interpret eval_expr(Meta.parse("1+2*3/4"), false, identity)
  0.021513 seconds (37.51 k allocations: 1.710 MiB)
2.5

julia> @time @interpret eval_expr(Meta.parse("1 > 2 ? 1 : 0"), false, identity)
  0.006699 seconds (20.38 k allocations: 1.014 MiB)
0

julia> @time @interpret eval_expr(Meta.parse("1+2 < 2+3 ? (3*4; 4*5) : 5*6"), false, identity)
  0.018553 seconds (45.14 k allocations: 1.830 MiB, 23.22% gc time)
20
2 Likes

That is an interesting idea. In my full-scale interpreter, it still needs to be combined with @nospecialize, but it is getting better: The expression ‘(2 + 3)*(4 + 5)’ takes

  3.386258 seconds (7.29 M allocations: 362.361 MiB, 3.10% gc time)

to be evaluated. Still slow, though. But, at least, much better for subsequent evaluations: (2 * 3)/(4 * 6) requires

  0.011368 seconds (62.78 k allocations: 2.427 MiB)

Completely different expressions are also quickly evaluated: ((x,y)->x+y)(1,2) would not even finish before but now it takes:

  0.062913 seconds (152.72 k allocations: 7.497 MiB)

At this moment, I might as well extend the direct style interpreter to support the language features that are used by the continuation-passing style interpreter (arrays, tuples, structs, etc) and use the former to run the latter. It might not run as fast but, at least, it would look more pedagogical than having to explain that we need to use @nospecialize and JuliaInterpreter just to overcome the huge computational complexity of Julia’s method specialization strategy. In any case, I would love to know the reason for this behavior.

Thanks a lot for all the help.

Yeah, that’s JuliaInterpreter compiling itself a bit :slight_smile:

Check out

using MappedArrays
a = mappedarray(identity, [1])
b = mappedarray(x->x^2, [1])
getfirst(x) = @inbounds x[1]
@code_native debuginfo=:none getfirst(a)
@code_native debuginfo=:none getfirst(b)

The fact that this is optimized is absolutely essential for certain kinds of work. And for it to be optimized means it has to be specialized on the specific function being passed.

3 Likes

Agreed. But that does not explain the time and space complexity of the specialization process.

Is there an idea of the time/space behavior of the Julia compiler? I know that it might depend on several dimensions, such as number of methods, number of parameters of a method, number of different methods, number of control paths (which, itself, is exponential in the number of branches), etc, but it might be the case that the time/space complexity that we are seeing in the tiny CPS interpreter that I posted can be fixed by replacing some algorithms and data structures used by the compiler.

There’s always room for work on compiler performance. Your best bet is to start profiling (see ProfileView or StatProfilerHTML). As far as compiler constants, the main ones are in https://github.com/JuliaLang/julia/blob/master/base/compiler/typelimits.jl.

1 Like