Does Julia Create a "1.5" Language Problem?

bertschi · December 29, 2023, 1:41pm

It appears that many people see a different language in(side) Julia than I do …
To me, Julia feels a lot like Common Lisp, albeit with a couple of advantages (as well as some disadvantages). First, and foremost, the nicest thing about Julia is its multiple dispatch being pervasive throughout the language. Thereby, extensibility is not an add-on or afterthought, but unrestricted leading to very modular and composable code (as well as to some issues with informal interfaces and the like).
Secondly, compiling methods as late as possible, i.e., at the time of first call, allows to optimize most (or some depending on your code) uses of dynamic dispatch and enables global optimizations almost as in statically typed languages which are very hard in most dynamically typed languages (such as Common Lisp where all functions/methods are compiled at time of definition and thus it’s use within other (non-generic) functions cannot be devirtualized. Similarly, also C++ will not de-virtualize single dispatch via vtables, but mainly instantiate templates at compile-time). On the other hand, this might require substantial compilations of large code thunks in order to specialize and devirtualize all method calls as observed in TTFX issues.
Third, having the runtime available at all times is a big plus to me and much better for interactive use and exploration of an existing code base. In the end, I get mainly annoyed from type systems telling me that I cannot do this or that – even though it might catch some stupid mistakes, but so does a very interactive workflow writing and trying out small functions from the REPL immediately.
Overall, coming from Lisp, Julia feels almost at home and man is it fast. I can understand though that coming from Haskell, C++, Rust etc. you might get a different impression.

giordano · December 29, 2023, 1:52pm

There’s actually a PR from 4 days ago to add a JIT compiler to CPython:

github.com/python/cpython

GH-113464: A copy-and-patch JIT compiler

python:main ← brandtbucher:justin

opened 07:17AM - 25 Dec 23 UTC

brandtbucher

+1722 -15

'Twas the night before Christmas, when all through the code Not a core dev was …merging, not even Guido; The CI was spun on the PRs with care In hopes that green check-markings soon would be there; The buildbots were nestled all snug under desks, Even PPC64 AIX; Doc-writers, triage team, the Council of Steering, Had just stashed every change and stopped engineering, When in the "PRs" tab arose such a clatter, They opened GitHub to see what was the matter. Away to CPython they flew like a flash, Towards sounds of `PROT_EXEC` and `__builtin___clear_cache`. First [LLVM was downloaded, unzipped](https://github.com/brandtbucher/cpython/blob/justin/Tools/jit/README.md) Then the Actions were running [a strange new build script](https://github.com/brandtbucher/cpython/blob/justin/Tools/jit/build.py), When something appeared, they were stopped in their tracks, [`jit_stencils.h`](https://gist.github.com/brandtbucher/d6397b1d106c55164b32df27d53eb7b8), generated from [hacks](https://github.com/brandtbucher/cpython/blob/justin/Tools/jit/template.c), With their spines all a-shiver, they muttered "Oh, shit...", They knew in a moment it must be a JIT. More rapid than interpretation it came And it copied-and-patched every stencil by name: "Now, `_LOAD_FAST`! Now, `_STORE_FAST`! `_BINARY_OP_ADD_INT`! On, `_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT`! To the top of the loop! And down into the call! Now cache away! Cache away! Cache away all!" But why now? And how so? They needed a hint, Thankfully, Brandt gave a great talk at the sprint; So over to [YouTube](https://youtu.be/HxSHIpEQRjs) the reviewers flew, They read [the white paper](https://dl.acm.org/doi/10.1145/3485513), and [the blog post](https://sillycross.github.io/2023/05/12/2023-05-12) too. And then, after watching, they saw its appeal Not writing the code themselves seemed so unreal. And the platform support was almost too easy, ARM64 Macs to 32-bit PCs. There was [some runtime C](https://github.com/brandtbucher/cpython/blob/justin/Python/jit.c), not too much, just enough, Basically a loader, relocating stuff; It ran every test, one by one passed them all, With not one runtime dependency to install. Mostly build-time Python! With strict static typing! For maintenance ease, and also nerd-sniping! Though dispatch was faster, the JIT wasn't wise, And the traces it used still should be optimized; The code it was JIT'ing still needed some thinning, With code models small, and some register pinning; Or [new calling conventions](https://discourse.llvm.org/t/rfc-exposing-ghccc-calling-convention-as-preserve-none-to-clang/74233), shared stubs for paths slow, Since this JIT was brand new, there was fruit hanging low. It was awkwardly large, parsed straight out of the ELFs, And they laughed when they saw it, in spite of themselves; A `configure` flag, and no merging this year, Soon gave them to know they had nothing to fear; It wasn't much faster, at least it could work, They knew that'd come later; no one was a jerk, But they were still smart, and determined, and skilled, They opened a shell, and configured the build; `--enable-experimental-jit`, then made it, And away the JIT flew as their "+1"s okay'ed it. But they heard it exclaim, as it traced out of sight, "Happy JIT-mas to all, and to all a good night!" * Issue: gh-113464

Results:

> python bench.py 
min ... max: 1156.847 ns ... 2555.920 ns
median: 1161.976 ns
mean +- stddev: 1165.322 ns +- 36.903 ns

min ... max: 1304.791 ns ... 2294.531 ns
median: 1360.398 ns
mean +- stddev: 1364.048 ns +- 19.811 ns

> julia bench.jl
BenchmarkTools.Trial: 10000 samples with 251 evaluations.
 Range (min … max):  299.191 ns … 504.641 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     300.422 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   302.278 ns ±  10.312 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▇█                                                           ▂
  ███▇▃▅▄▇█▆▆▅▅▅▅▅▅▆▆▅▆▅▆▇▄▅▅▅▅▃▆▅▅▅▅▅▅▅▄▅▄▆▅▄▄▅▃▅▅▄▄▅▄▃▄▄▄▁▄▄▅ █
  299 ns        Histogram: log(frequency) by time        359 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (min … max):  386.480 ns …  2.449 μs  ┊ GC (min … max): 0.00% … 77.01%
 Time  (median):     398.871 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   408.175 ns ± 72.037 ns  ┊ GC (mean ± σ):  0.58% ±  2.81%

  ▁    ██▆▄▂▁                                       ▁▁         ▁
  █▇▆▅███████▆▇▆█▇▆▇▆▆█▇▇▅▅▄▅▄▅▅▅▅▅▆▇█▇▇▆▅▅▄▄▂▅▅▅▇█████▇▅▄▂▆▅▇ █
  386 ns        Histogram: log(frequency) by time       506 ns <

 Memory estimate: 336 bytes, allocs estimate: 21.

> cargo bench
...
     Running benches/bench_dyn_dispatch.rs (target/release/deps/bench_dyn_dispatch-baccd544aa3dd6a1)
Benchmarking dynamic dispatch with virtual methods: Collecting 100 samples in estimated 5.0dynamic dispatch with virtual methods
                        time:   [27.381 ns 27.393 ns 27.409 ns]
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) high mild
  14 (14.00%) high severe

Note that there are two benchmark runs for Python and Julia because both languages employ small integer caching. This lets us disentangle some of the overhead related to boxing arguments for dynamic dispatch from everything else.

So at least on this example, untyped Julia is quite a bit faster. I genuinely tried multiple permutations of the code to reduce Python-side overhead and make things less advantageous for Julia, but the gap remained. I’m sure there are real world examples showing the opposite, but I couldn’t think of where to start there. If anyone has experience this, here’s your incentive to prove me wrong!

More interesting IMO is the comparison between Julia and Rust. Even when allocation overhead is eliminated by sticking to small integers, single dispatch with vtables in Rust is an order of magnitude faster. What’s surprising is that a perf gap exists despite the optimizations Julia has for method lookup. Quoting from Julia Functions · The Julia Language :

The next issue concerns the structure of method cache hash tables. Empirical studies show that the vast majority of dynamically-dispatched calls involve one or two arguments. In turn, many of these cases can be resolved by considering only the first argument. (Aside: proponents of single dispatch would not be surprised by this at all. However, this argument means “multiple dispatch is easy to optimize in practice”, and that we should therefore use it, not “we should use single dispatch”!) So the method cache uses the type of the first argument as its primary key. Note, however, that this corresponds to the second element of the tuple type for a function call (the first element being the type of the function itself). Typically, type variation in head position is extremely low – indeed, the majority of functions belong to singleton types with no parameters. However, this is not the case for constructors, where a single method table holds constructors for every type. Therefore the Type method table is special-cased to use the first tuple type element instead of the second.

One can see the exact implementation of this hash table and the rest of the method lookup logic in julia/src/gf.c at v1.10.0 · JuliaLang/julia · GitHub.

Thus my follow-up question is why we’re so much slower. Is it the speed of method lookup or something else? Is there any low-hanging fruit which could be optimized here?

“The best way to get the right answer on the Internet is not to ask a question; it’s to post the wrong answer.” ↩︎
Why Rust? Unlike Java, it’s easy to verify that the compiler is not devirtualizing the calls being made here. And unlike C, I can actually write correct Rust code . ↩︎

anon56330260 · December 29, 2023, 8:58pm

You need to test complicated enough programs. This toy case is specially handled by Julia’s method table lookup procedure. That’s why it’s deceptive. Consider some classical leetcode problems with complicated control flows.

Internally, when performing a dynamic dispatch (jl_apply_generic), the compiler firstly gets the address of current stackframe, then it uses this address to look up a method table cache. In your case, all the dynamic calls occur at the same place so it perfectly hits the cache…

function bench(xs, i)
    res = i
    for x in xs
        # all the dynamic calls have the same stackframe address
        res = dostuff(x, res)
    end
    return res
end

There’s nothing we can do here…What’s worse, these caches will become deoptimizations when you have real runtime dispatch instead of type instability, because the chance of hitting cache becomes much lower.

There are many pathological cases. For example, once I saw Plots.jl generates a method instance of 200 arguments and CSV.jl generates a function with 10000 lines of IR codes (with all the CSV options as inputs). Dispatch of keyword arguments also work pretty badly. There is no way to systematically control them.

ToucheSir · December 29, 2023, 9:26pm

Can you provide some runnable examples we can use then? The only reason I bothered writing something up is because nobody was willing to write or link to benchmarks showing pathological behaviour of dynamic dispatch in Julia, especially cases where an equivalent Python program was faster.

If there isn’t a benchmark to test against, nobody has an incentive to try to improve the situation because they can’t know if their changes are doing anything! And we already have a precedent for such benchmarks in GitHub - JuliaCI/BaseBenchmarks.jl: A collection of Julia benchmarks available for CI tracking from the JuliaLang/julia repository. It just doesn’t have anything which stresses dynamic dispatch in isolation, to my knowledge.

anon56330260 · December 29, 2023, 9:58pm

Consider the quicksort example. Suppose that a Julia newbie writes this code by directly porting python code to julia. I saw a lot of these kinds of codes when I tested my type checker.

def partition(array, low, high):
 
    # choose the rightmost element as pivot
    pivot = array[high]
 
    # pointer for greater element
    i = low - 1
 
    # traverse through all elements
    # compare each element with pivot
    for j in range(low, high):
        if array[j] <= pivot:
 
            # If element smaller than pivot is found
            # swap it with the greater element pointed by i
            i = i + 1
 
            # Swapping element at i with element at j
            (array[i], array[j]) = (array[j], array[i])
 
    # Swap the pivot element with the greater element specified by i
    (array[i + 1], array[high]) = (array[high], array[i + 1])
 
    # Return the position from where partition is done
    return i + 1
 
# function to perform quicksort
 
 
def quickSort(array, low, high):
    if low < high:
 
        # Find pivot element such that
        # element smaller than pivot are on the left
        # element greater than pivot are on the right
        pi = partition(array, low, high)
 
        # Recursive call on the left of pivot
        quickSort(array, low, pi - 1)
 
        # Recursive call on the right of pivot
        quickSort(array, pi + 1, high)

The ported Julia code. It’s carefully designed to introduce dynamic dispatch.


struct ZeroArray
    arr
end

Base.getindex(arr::ZeroArray, i) = arr.arr[i+1]
Base.setindex!(arr::ZeroArray, v, i) = Base.setindex!(arr.arr, v, i+1)

global xxx = nothing
const ref = Ref{Any}()
# Function to find the partition position
function partition(array, low, high)
    low = xxx === nothing ? low : ref[]
    high = xxx === nothing ? high : ref[]
    # choose the rightmost element as pivot
    pivot = array[high]
 
    # pointer for greater element
    i = low - 1
 
    # traverse through all elements
    # compare each element with pivot
    for j in low:high-1
        if array[j] <= pivot
 
            # If element smaller than pivot is found
            # swap it with the greater element pointed by i
            i = i + 1
 
            # Swapping element at i with element at j
            (array[i], array[j]) = (array[j], array[i])
        end
    end
 
    # Swap the pivot element with the greater element specified by i
    (array[i + 1], array[high]) = (array[high], array[i + 1])
 
    # Return the position from where partition is done
    return i + 1
end
# function to perform quicksort
 
 
function quickSort(array, low, high)
    if low < high
 
        # Find pivot element such that
        # element smaller than pivot are on the left
        # element greater than pivot are on the right
        pi = partition(array, low, high)
 
        # Recursive call on the left of pivot
        quickSort(array, low, pi - 1)
 
        # Recursive call on the right of pivot
        quickSort(array, pi + 1, high)
    end
end

Test code:

import random
x = [i for i in range(100000)]
random.shuffle(x)
import time

t0 = time.time()
quickSort(x, 0, len(x)-1)
t1 = time.time()

total = t1-t0
print(total)

import Random
arr = Vector{Any}(1:100000)
Random.shuffle!(arr)
x = ZeroArray(arr)
# trigger compilation
quickSort(x, 0, length(arr)-1)

arr = Vector{Any}(1:100000)
Random.shuffle!(arr)
x = ZeroArray(arr)

print(@time quickSort(x, 0, length(arr)-1))

On my computer, Julia runs 5 times slower than Python (1s vs 0.2s). I use Julia 1.9.2 so we may assume that this is not caused by slowness of old Julia.

What’s worse, I just notice that if you print the arr and try to interrupt it, this is what you get. Maybe this is already fixed in new Julia, I don’t know.

ToucheSir · December 30, 2023, 1:20am

Interesting example, thanks. I took the liberty to make some changes for ease of reporting:

Used a version of the same results printing code as above
Used a reversed range as input instead of a shuffled one. This is because the amount of time quicksort runs is heavily dependent on how well sorted the array is already.
Reduced the array length to 1000 because the O(N^2) scaling on a worst case (reversed) input would be too slow otherwise. This also necessitated bumping up the default Python recursion limit.
Added an equivalent ZeroArray type to the Python code for a better apples-to-apples comparison. This just forwards __getitem__ and __setitem__.
Added a second benchmark using just the inner Vector{Any} without any wrappers for the same reason.

With that out of the way, here are the results:

> python quicksort.py
With ZeroArray wrapper:
min ... max: 136.786 ms ... 162.389 ms
median: 139.562 ms
mean +- stddev: 140.089 ms +- 3.208 ms

Without ZeroArray wrapper:
min ... max: 22.793 ms ... 26.501 ms
median: 22.913 ms
mean +- stddev: 23.313 ms +- 0.938 ms

> julia quicksort.jl
With type unstable ZeroArray wrapper:
BenchmarkTools.Trial: 37 samples with 1 evaluation.
 Range (min … max):  128.496 ms … 149.270 ms  ┊ GC (min … max): 0.79% … 3.09%
 Time  (median):     135.176 ms               ┊ GC (median):    3.01%
 Time  (mean ± σ):   136.266 ms ±   4.866 ms  ┊ GC (mean ± σ):  3.53% ± 1.59%

              █                                                  
  ▅▁▁▅▁▅▁▁▁█▁██▅▁█▁▅▅█▅▅▅▁▅█▅▅▅▅▅▁▁▁▁▅▁▁▁▁▅▅▁▁▁▁▁▅▅▁▁▁▁▁▁▅▁▁▁▁▅ ▁
  128 ms           Histogram: frequency by time          149 ms <

 Memory estimate: 62.03 MiB, allocs estimate: 3564405.

Without ZeroArray wrapper:
BenchmarkTools.Trial: 53 samples with 1 evaluation.
 Range (min … max):  90.904 ms … 106.155 ms  ┊ GC (min … max): 0.00% … 1.57%
 Time  (median):     94.294 ms               ┊ GC (median):    0.70%
 Time  (mean ± σ):   94.476 ms ±   2.508 ms  ┊ GC (mean ± σ):  0.69% ± 0.68%

  ▃     █ ▃    █  ▃▃ ▃ ▃  ▃█▃   ▃▃▃    ▃                        
  █▁▁▇▁▁█▇█▁▇▇▁█▇▁██▁█▇█▁▇███▇▁▁███▇▁▇▇█▇▁▁▇▇▁▁▇▁▁▇▁▁▇▁▇▁▁▁▁▁▇ ▁
  90.9 ms         Histogram: frequency by time         99.1 ms <

 Memory estimate: 24.46 MiB, allocs estimate: 1102456.

So untyped wrapper perfomance is basically the same, which is surprising. Using the array/list directly is a 4-5x speed improvement in favour of Python, as you mentioned.

ToucheSir · December 30, 2023, 1:30am

And for completeness, here’s how the Julia times change if xxx is made const:

 > julia quicksort.jl
With type unstable ZeroArray wrapper:
BenchmarkTools.Trial: 58 samples with 1 evaluation.
 Range (min … max):  85.002 ms … 98.931 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     85.927 ms              ┊ GC (median):    0.31%
 Time  (mean ± σ):   87.055 ms ±  2.816 ms  ┊ GC (mean ± σ):  0.33% ± 0.32%

  █▄ ▂▅                                                        
  ██▃███▅▁▁▅▁▅▃█▁▁▃▃▃▁▃▃▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  85 ms           Histogram: frequency by time        97.6 ms <

 Memory estimate: 25.94 MiB, allocs estimate: 1700025.

Without ZeroArray wrapper:
BenchmarkTools.Trial: 746 samples with 1 evaluation.
 Range (min … max):  6.560 ms …  11.054 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.585 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.695 ms ± 357.016 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▂▂▄  ▂▁                                                    
  █████████▅▇▇▅▅▁▅▆▆▆▆▅▄▄▅▅▁▄▁▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▄▄ ▇
  6.56 ms      Histogram: log(frequency) by time      8.56 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

anon56330260 · December 30, 2023, 1:46am

Yeah…

I think python has special optimizations for bultin array and python programmers seldomly define their own getter and setter function, which makes python a really good tool for scripting. In contrast, Julia has many wrapper functions and specializations, so people have to pay attention to correctly use these wrappers.

And this is only one of the issues. We have keyword arguments, closures, for generator, ntuple and varargs. Dispatch algorithms of these functions are also more complicated as they involve many arguments. It’s so easy to introduce type instabilities through them. Let alone the compilation latency problem.

Here is one interesting fact: although my SimpleTypeChecker.jl uses a simple type inference algorithm, it still suffers from the compilation latency problem. Because to handle closure and recursive functions, it relies on the groundtruth result from Julia’s builtin type inferencer. If you have many recursively defined functions and you make a small mistake somewhere, Julia will only gradually converge to bottom type and a lot of effects is wasted in this process. The slowdown is really evident, around 5x.

Changing constantness of xxx suddenly makes Julia type stable. The performance gap is too huge here and it also greatly improves the compilation latency.

Benny · December 30, 2023, 2:45am

I disagree, to make fair comparisons of specific mechanisms like runtime dispatch, you need microbenchmarks, likely of pieces of a full mechanism. Your quicksort benchmark does many things in addition to numerous runtime dispatches, so the comparison to Python becomes too imprecise. It’s only suitable for profiling quicksorts of heterogenous vectors, not for precise blanket comparisons like “untyped Julia is 5x slower than pure Python”. Not that a precise blanket comparison is even possible for 2 whole languages that do several things, but the Julia quicksort isn’t even entirely untyped. Then again, I’m not sure if CPython’s own compiler doesn’t optimize away runtime dispatch of integer addition; it does trim the bytecode a little.

From what I can tell, I agree that the multiple dispatch mechanism is likely more complicated than what ToucheSir’s first benchmark measures, so it might not have a speed advantage in other cases, like more arguments or absence of existing compiled methods. That latter case would be hard to benchmark because you’d need to separate it from the method compilation time, hence “pieces of a full mechanism”.

If the concern is that the method table might be fixed at compile-time, you could vary the callable’s type to force looking up different method tables at runtime. After all, dispatch occurs over the types of the callable and the arguments.

Not if you really need it. Small-Union-splitting is also technically runtime dispatch, only via a cheap branch to a few compiled methods fixed at compile-time, and it’s very necessary for the idiomatic practice of returning nothing or missing. It’s not a stretch to want faster, even return type-stable, runtime dispatch in more cases, possibly allowed by semantic limitations.

ToucheSir · December 30, 2023, 3:01am

I’m not sure the difference is huge. In practice people love to use plain arrays in Julia too because the [] syntax is convenient, and most wrappers are typed instead of untyped. Agreed with the rest of your comment though.

Yes, I included it more for curiosity but folks can ignore it since it doesn’t really factor into this discussion. Either way, one can recreate the same timings without globals by @nospecialize-ing low and high, which is basically what Python does.

On that note, I did test on an older Python version (3.10) which doesn’t have many of the recent perf improvements around simplifying, specializing or optimizing bytecode. That said, the easiest way to judge whether the quicksort benchmark is precise enough would be to benchmark it using whatever tools are most appropriate for this. It may be that non-dispatch related stuff only accounts for 50% of the perf difference, for example. I would volunteer but I wouldn’t know where to start; this seems like a task for someone with proper knowledge of internals and working with Julia’s C(++) code.

anon56330260 · December 30, 2023, 3:19am

I don’t need to do this. Becaues my intent is not to compare Python and Julia fairly.

I just want to point out what will happen if some intuitively ported code is not well typed - it can lead to extremely bad performace and even falls behind Python. If untyped Julia is always faster than Python, then we can use tranditional tiered JIT techiques to hide compilation latency. The immediate result of this simple fact is that you have to write well-typed codes, otherwise long compilation latency is a problem.

Note, I am 100% agree that carefully written type stable Julia codes are generally much faster than Python. But it requires a high threshold to do so without a type checker. And when you try to refactor your codes it’s easy to introduce unstable codes. That’s why I want to emphaize the importance of performance of dynamic dispatch and avoid artifical microbenchmarks, because code patterns of these microbenchmarks are different to what you might see in the wild and deceptive.

We don’t have many options here. If people don’t want to write typed codes then we can only do tiered compilation and do interpretation when appropriate.

No, it’s because we need recursion and branching here, to create unstable stackframe addresses so that this small cache misses. That’s why I pick this classical sorting algorithm.

Benny · December 30, 2023, 3:39am

That’s fine, but it’s starting to veer away from the topic of runtime dispatch performance specifically. It can’t be “untyped” in a formal sense, either; there’s no such thing as an untyped instance, the types just exist at runtime. Even ignoring the integers, the quicksort benchmark is not fully untyped in the inference sense either, Vector{Any} is a concrete type so you won’t be dispatching its methods like getindex at runtime. Barring some bytecode optimization I’m not aware of, that’s not fair to CPython.

Not really, whether you’re compiling too much is an orthogonal issue. You can write poorly inferred code that compiles the minimum, that’s an intentionally leveraged advantage.

People who stuck around after hearing about annotations, type parameters, and zero(eltype(A)) definitely do. People who write Python do too, that’s why NumPy/Numba/Cython and Mypy/Pyre exist. People who avoid thinking about types likely don’t want to write much code anyway.

Could you explain the purpose of this? My understanding is that to properly compare with Python’s searching an instance’s type for its encapsulated methods, guaranteeing that Julia looks up a callable’s method table at runtime would be enough. What would unstable stackframe addresses and cache misses accomplish? It’s not even apparent that either must occur when Python does runtime dispatch.

adienes · December 30, 2023, 4:03am

I think it requires a high threshold of skill to write 100% type stable julia

but I don’t think 100% type stable julia is required to beat Python in speed. in fact, when I implement my code in both, I have never failed to beat my Python implementation with my Julia implementation, and it usually doesn’t take much effort