Same performances between type-stable and type-unstable code

albertomercurio · October 1, 2024, 11:11pm

I’m writing some text explaining the importance of having type-stable code in Julia. I made the simple example of global variables:

y = 2.4

function bar(x)
    res = zero(x)
    for i in 1:1000
        res += y * x
    end
    return res
end

and the benchmark returned

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.333 μs …  22.394 ms  ┊ GC (min … max): 0.00% … 99.79%
 Time  (median):     28.462 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.535 μs ± 224.046 μs  ┊ GC (mean ± σ):  8.26% ±  2.05%

         ▂█▇▅▃                               ▄▅▅▅▂              
  ▂▂▃▃▂▃▄██████▄▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▂▂▂▃▄▆██████▅▃▃▂▂▂▂▂▂▂▂▂ ▃
  18.3 μs         Histogram: frequency by time           34 μs <

 Memory estimate: 46.88 KiB, allocs estimate: 3000.

While for the type stable case (using const)

const z = 2.4

function bar(x)
    res = zero(x)
    for i in 1:1000
        res += z * x
    end
    return res
end

And the benchmark

BenchmarkTools.Trial: 10000 samples with 235 evaluations.
 Range (min … max):  317.043 ns … 408.153 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     318.221 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   319.706 ns ±   5.002 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▂▁█▅          ▃▂▁                                       ▂▁ ▂ ▁
  █████▅▁▃▃▄▃▃▆▇█████▇▇▆▆▆▆▅▅▄▁▅▃▄▄▄▃▄▁▄▄▄▁▃▃▅▄▄▄▅▃▃▁▄▄▄▅▅▅██▇█ █
  317 ns        Histogram: log(frequency) by time        337 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which is a good result.

Now, I want to extend the demonstration with a more complicated function involving structs and LinearAlgebra operations on them. But I noticed that the difference in performance decreased a lot so the two approaches are basically the same. Here I show a very simple example where I simply invert a global matrix.

test_bad = rand(100, 100);

function test_inv_bad()
    return inv(test_bad)
end

@benchmark test_inv_bad()


BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  172.951 μs …  12.548 ms  ┊ GC (min … max): 0.00% … 98.00%
 Time  (median):     244.284 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   259.085 μs ± 147.885 μs  ┊ GC (mean ± σ):  1.13% ±  2.99%

                  ▆▂                           ▁█▃               
  ▃▄▂▂▂▁▁▁▁▁▁▁▁▂▂▃██▅▃▂▂▂▂▄▇▅▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃███▄▂▂▂▂▂▃▂▂▂▂▂▂ ▃
  173 μs           Histogram: frequency by time          330 μs <

 Memory estimate: 129.17 KiB, allocs estimate: 5.

and

const test = rand(100, 100);

function test_inv()
    return inv(test)
end

@benchmark test_inv()

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  198.517 μs …  12.616 ms  ┊ GC (min … max): 0.00% … 97.69%
 Time  (median):     213.271 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   228.217 μs ± 142.781 μs  ┊ GC (mean ± σ):  1.12% ±  2.82%

          ▅█▃                                                    
  ▁▁▁▂▂▂▁▄███▇▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▃▃▂▂▃▃▂▂▁▁▁▁▁▁▁ ▂
  199 μs           Histogram: frequency by time          286 μs <

 Memory estimate: 129.17 KiB, allocs estimate: 5.

I guess that there is a runtime dispatch before calling inv, such that the compiler knows the type of the matrix to invert, and so it can compile the code from that point onwards.

But then, how to show the loss of performance using LinearAlgebra routines when using type-unstable functions?

Benny · October 1, 2024, 11:25pm

Runtime dispatch itself takes extra time, but as you’re showing it can be outweighed by the time taken to do other things you have to do. That’s why easy runtime dispatch is a feature.

Something your test doesn’t measure is the downstream effect of type instability (the compiler failing to infer the concrete runtime types, which is often impossible to infer from how it is written). inv(test_bad) will have an abstract inferred type, so the following code inside the method that relies on its output will also be type-unstable and have to do runtime dispatches. Sometimes it’s possible for type conversions or separately compiled function calls (aka function barriers) to isolate the runtime dispatches and recover type stability in the following code.

albertomercurio · October 1, 2024, 11:30pm

Thanks for your reply.

But I don’t understand how to show the lose of performance in the type-unstable case. Se first case worked, but now the cases involving LinearAlgebra stuff. In the example I want to show I have a more complicated function, with a repeated multiplication with matrices and vectors. I get the same results of the type-stable code, which was surprising to me.

Tortar · October 1, 2024, 11:36pm

I guess the motivation of the difference in this case is that the inversion is a single operation (inv is a function barrier), while the other involves a loop where you use a global variable, intuitively the second case creates much more instability because the operation is repeated

albertomercurio · October 1, 2024, 11:43pm

function test_inv_bad(N)
    res = zeros(size(test_bad))
    for i in 1:N
        res += inv(res + test_bad)
    end
    return res
end

@benchmark test_inv_bad(2000)

This case also gives the same performance.

Tortar · October 1, 2024, 11:52pm

I don’t see the same performance on this using test and test_bad actually, but yes it is a not that big difference because of only 2000 iterations and most importantly heavy inv operations overall, but let’s wait @benny response

Benny · October 1, 2024, 11:56pm

It’s really just the one (or two after it’s lowered) unstable expressions being looped, so I’d rephrase this as runtime dispatches being repeated.

That case still has 1-2 runtime dispatches per inv. I can’t say that every runtime dispatch takes around the same time, but your first example has 2000 taking an extra 28us. 1-2 isn’t going to make a dent in the 213us inv is taking on your 100x100 matrix, and it won’t even be measurable in the runtime noise.
Type instability is much more of a concern when runtime dispatches balloon to greater numbers in the following code and your actual work takes a shorter time. Say a runtime dispatch takes 14ns, it’ll matter when it’s in a loop with a 3ns addition.

albertomercurio · October 2, 2024, 7:55am

Ok,

so the problem of performances can be seen when there are many runtime dispatches on easy problems, right? In the case of LinearAlgebra stuff, to see a noticeable difference, one should have a number of dispatches approximately equal to (time of the inner operation) / (runtime dispatch time). As you mentioned, the runtime dispatch would take 12ns while the inv takes 213us. So I would need to iterate it approximately 213000/12 times to see some differences. Have I understood correctly?

Moreover, just a final questions. Does the runtime dispatch actually fix the time instability after its execution? What I mean is if, once the runtime dispatch is executed, if the inner code after that would be type-stable. My guess is yes.

Benny · October 2, 2024, 8:21am

I don’t know if there’s a ratio to aim for, and in my experience there’s no decent estimate of how long a runtime dispatch takes because the processor tries a few things to run faster. However, you are right that you would need a lot of runtime dispatch to outweigh something that inherently takes longer to run.

Maybe. The called function’s code can be type stable given the runtime types of the arguments; separate methods being compiled separately offers a chance that one could be type stable despite being called by a type-unstable method, hence “function barrier.” The following code in the same method as the runtime dispatched call can be type stable if it never uses the unstable return value of the runtime dispatched call; otherwise, the instability usually propagates.

Note that a inferring an abstract type doesn’t always cause runtime dispatch or type instability. A very small (<=3) inferred Union type usually lowers to branched compile-time dispatches that might recover stability. The compiler might infer a concrete return type if it doesn’t depend on the unstable input type for functions with few methods (I think <=3), but that optimization is brittle if the function is public and subject to method addition. Always worth using @code_warntype, JET, or Cthulhu to check.

albertomercurio · October 2, 2024, 9:00am

Ok, but in my case the return type of inv may depend on the input type (in this case unstable). Am I wrong?

Here, instead, It seems that there is the runtime dispatch before calling inv, so inv already knows the input type and dispatches it to LAPACK or whatever (and so it is type stable already inside inv).

Benny · October 2, 2024, 10:16am

Correct.

There’s no contradiction here.

albertomercurio · October 2, 2024, 11:41am

Sorry, but then I don’t understand

We said that runtime dispatch runs before calling inv, meaning that the compiler kind of stopped to know what was the array type at that point (am I right?). So, as far as I understood, we don’t see a huge difference in performance because the multiple dispatch knows to run it using LAPACK or whatever (so it knows that it has to deal with a Matrix thanks to runtime dispatch). Then this operation is quite heavy compared to the runtime dispatch, and so we see negligible differences.

abraemer · October 2, 2024, 1:19pm

You seem to be confusing runtime dispatch and compilation but the two are not directly related. Compilation happens just fine even for unknown types. It is just that the code emitted is very generic and cannot assume any types which makes it slow code. So your

Is compiled to something like:

* Load global value of global variable :test_bad
* Call function :inv with that value

Where that second step involves runtime dispatch, which means that the Julia runtime looks at the concrete value of test_bad (every value at runtime has a concrete type) and at the method table of inv and then figures out which method to call. Inside that method the type is perfectly known and thus the specific method of inv runs fast! This is the principle behind a function barrier.

So how could you demonstrate the effect of type instability then? You’d need to have more code at the place where the instability is, i.e. in test_bad. For example by using the inverted matrix a bunch of times because that triggers a dynamic dispatch every time. Note though that each of this dispatches needs to be a very small function or otherwise the overhead of dynamic dispatch won’t show.

A takeaway here is that dynamic dispatch/type instability really only matters in tight inner loops. As soon as the type unstable function call takes more than a couple 100mus, the dynamic dispatch starts to become negligible. Not every Julia code needs to be type stable - it is sufficient if the important parts are

albertomercurio · October 3, 2024, 3:22pm

Everything clear. Thank you all.

Topic		Replies	Views
Minimal example for highlighting type instability's impact on performance Teaching & Outreach	8	336	July 11, 2023
Type stability with type as argument and permutedims Performance type-stability	3	140	January 4, 2024
Suggestions for a good example of a realistic type instability for Julia demo? General Usage	10	1049	April 17, 2019
Performance issue with use of eltype()? General Usage performance	7	1059	September 7, 2017
Float64 comparison operator performance Performance	8	1046	September 26, 2019

Same performances between type-stable and type-unstable code

Related topics