Optimized Python is as good as Julia

Comparison of Python / Julia benchmarks for a maths problem.

I am considering switching to Julia, but would like to know if it is worth it for such tasks.

Here is the Python code I used:

import random
import numpy as np
import matplotlib.pyplot as plt
import time

def benchmark_it(simul_func: callable, *args, **kwargs):
    def wrapper(*args, **kwargs):
        start = time.time()
        simul_func(*args, **kwargs)
        end = time.time()
        return end - start
    return wrapper

def run_simulations(n: int, k: int, n_iters: int) -> float:
    hats = np.arange(n)
    ideal = np.arange(n)

    def did_k_matches_occur_in_the_simulation():
        matches = np.sum(hats == ideal)
        return matches == k

    return sum(did_k_matches_occur_in_the_simulation() for _ in range(n_iters)) / n_iters

n_people_to_try = [250, 1000, 5000, 10000]

times = [run_simulations(n_people, 0, 100_000) for n_people in n_people_to_try]

plt.scatter(n_people_to_try, times)
plt.xlabel("'n' in the simulation")
plt.ylabel('Time (s)')

I also attach the used Julia file:

using Random
using BenchmarkTools
using Plots

function run_simulations(n::Int, k::Int, n_simulations::Int)::Float64
    # n: number of people
    # k: number of simulations
    # n_simulations: number of simulations to run

    hats = collect(1:n)
    ideal = collect(1:n)

    function did_k_matches_occur_in_the_simulation()::Bool
        sum(hats .== ideal) == k

    sum((did_k_matches_occur_in_the_simulation() for _ in 1:n_simulations)) / n_simulations

function run_rigorous_simulations()
    n_people_to_try::Array{Int, 1} = [250, 1000, 5000, 10000]

    mean_times::Array{Float64, 1} = []

    for n_people in n_people_to_try
        b = @benchmark run_simulations($n_people, 0, 100_000)
        push!(mean_times, mean(b).time / 1e9)  # Convert to seconds

    plot(n_people_to_try, mean_times,
        xlabel="'n' in the simulation", ylabel="Execution Time (s)",
        title="Performance of run_simulations", marker=:circle,


The plots I got. Julia | Python

The plots show that Julia is only \approx [1.5,2] times faster than Python version. Is this an expected behavior or I used a slow Julia implementation?

If latter, please suggest the ways to optimize the code.

Thanks in advance!

1 Like

I didn’t look at your code too deeply but in general well optimized numpy code has similar performance to julia code, so it’s not surprising. The thing about numpy is that you are always limited to what numpy supports.


Optimized assembly language is likely faster than Julia. Is that a good reason to program in assembly?


Hmm… That makes sense, however, I have never experienced these limitations, since scientific libraries like numpy have a lot to offer.

I wouldn’t say writing an efficient Python code is as hard and time-demanding as writing in Assembly. But I agree that Julia may be a bit easier when you get used to it.

1 Like

If everything you do is covered by numpy, there may be little reason for you to switch. There are however many cases which cannot be easily vectorized for numpy use (differential equations being a prominent example), which is the reason for the existance of cython, numba & co - but then slowly it is not a Python anymore.


Just saying that there are many more considerations for picking a PL than raw speed…


That’s a good point! Thanks!

You code does almost the exact same operations in Julia and in Python, so unsurprisingly its performance is very close for both.
Here, I’m pretty sure the runtime is dominated by random number generation and shuffling – and presumably both languages do a reasonably good job at it, so there’s not much room to optimize the code. If that wasn’t the case however, you could easily make this Julia code more efficient: not materializing the ideal array (why put collect there?), not materializing hats .== ideal array (use sum or count with function argument, or lazy arrays), … . There changes are easy and common in Julia, but hard or impossible in numpy.


At the risk of contributing to a thread that’s already a lot of people piling on, I think it’s helpful to explain when you can expect Python to have the same performance as Julia.

Broadly speaking, NumPy is implemented in C, which has roughly the same performance as Julia. Therefore, the performance of NumPy functions that are called from Python can be expected to be as fast as Julia code.
The run time of the code in you OP is dominated by a few function calls: shuffle, pairwise equality of two arrays, and a sum over a boolean array. That’s why you see little difference in speed.

However, there are two reasons why Julia usually is faster than Python anyway:

First, often you don’t have the luxury of needing to solve a problem that can be expressed using only a few function calls into NumPy. Often, you need to write loops in Python to solve your problems. Try, for example, to find the most common sub-sequence of length 4 in a long array. Then suddenly you are no longer simply calling into C code where all the computation happens - now suddenly a large part of the computation happens in Python, and then you begin to see Julia being 10 or 100 times faster.

Second, for most problems, if it’s truly performance sensitive, there are many small optimisations than can be sought to improve performance of your code. Since Python relies on calling a small set of fixed NumPy functions for speed, this is not generally possible in Python. In Julia however, since the code is “at your fingertips” - i.e. the heavy lifting happens right at the level of the code your write yourself - you can do whatever optimisations you want.

For example, in your OP example, the expression np.sum(hats == ideal) first allocates an array of bools and then sums it, in two passes. This could be more efficient. In Julia, you could for example write count(splat(==), zip(hats, ideal)) to do this in a single pass without allocations.
Also in Julia, there is no need to collect the ideal, since 1:n is a perfectly fine vector that can be used in broadcasting operations - unlike NumPy, which lacks the genericness to allow a Python range object to interact with NumPy.

That leaves about 90% of the time spent in shuffle! making it harder to optimise further - and for me, the Julia code now runs 7x faster than Python.


Just a side note on benchmark comparisons, the mean time a benchmark takes is often a much more noisy and much less helpful metric than the median or minimum time, because the mean is going to be the most affected by random system noise and outliers, which will basically always bias your statistics in only one direction.


or maybe a little easier to understand/read:

sum(a == b for (a,b) in zip(hats,ideal))

A few comments. First, Numpy is multithreaded by default, which is not the case in Julia. So you’re comparing single threaded vs multithreaded.
Second, for-loops are always the way to go in Julia if you need speed.

I gave two options. One with a minor modification to your code. The other is deviating more by using a for-loop + multithread. You also need to be careful here, as multithreading code can introduce bugs. I used @tturbo because I know you can reduce operations with it, but if you use @Threads.threads in the way I wrote it, it’ll give incorrect results.

I haven’t tested the functions, because it’s hard to do it when you’re embedding functions into functions. This is known as closures and it creates problems, not only in terms of testability but also in terms of performance (it can easily introduce type instabilities). I’d avoid it for sure. Having separate functions will help you know where the bottlenecks, make sure that the results are correct, and test several approaches to improve speed.

Hope it helps!

using LazyArrays
function run_simulations2(n::Int, k::Int, n_simulations::Int)::Float64
    # n: number of people
    # k: number of simulations
    # n_simulations: number of simulations to run

    hats = collect(1:n)
    ideal = 1:n

    function did_k_matches_occur_in_the_simulation()::Bool        
        sum(@~ hats .== ideal) == k

    count(did_k_matches_occur_in_the_simulation() for _ in 1:n_simulations) / n_simulations
using LazyArrays, LoopVectorization

function run_simulations3(n::Int, k::Int, n_simulations::Int)::Float64
    # n: number of people
    # k: number of simulations
    # n_simulations: number of simulations to run

    ideal = 1:n
    hats = collect(1:n)
    function did_k_matches_occur_in_the_simulation()::Bool        

        sum(@~ hats .== ideal) == k

    output = 0.
    @tturbo for i in 1:n_simulations 
        output += did_k_matches_occur_in_the_simulation() 

    output / n_simulations

Can you take an example where this matters? I mean NumPy calls BLAS (MKL?), and Julia calls OpenBLAS which IS threaded, both I think only for matrix multiply, even if Julia isn’t started multi-threaded. And you can do that, then at least some code is multithreaded, at least your own code, unlike in Python.

I can’t see that anything in the stdlib/Base is threaded, but that may be the wrong viewpoint, everything is potentially threaded, at least the outer loop calling the stdlib.

For the new user here @tturbo (and more is in the package (as opposed to the built-in threads in Julia):

If you work with matrices, it’s true. But here you’d need to use for instance ThreadsX to get a parallelized version of sum.

I indicated the option with a for-loop to convey a more general message: translating Python code into Julia won’t exploit the advantages of using Julia. When I read OP’s code proposed in Julia, I got the impression that it’s necessary a more Julian approach to achieve speed and readable code. For instance, why there’s so much type annotation? Why allocate intermediate results? Why not use a for-loop? Why use closures? Why not write short functions so you can test them?

In any case (and this is my opinion), if you’re comfortable with Numpy and know that the type of operations you perform are covered by that, think twice about whether learning Julia is wise. You’ll have to learn how to improve performance, change how you think about the code, etc. There’s a trade off, and only OP knows whether this is worthy.

Overall, my message is that Julia CAN be fast, which doesn’t mean that Julia IS fast (in the sense of writing code and automatically getting great results). Moreover, as you have more options in Julia to improve your code, you’ll need to benchmark the time-consuming operations with different approaches/packages, until you get the most performant code for your specific example. In Python you’re bound by what others have implemented in C and write the code accordingly to exploit that. At least, that’s my experience with Python.


Yes this is expected. Python has tons of libraries wrapping performant code, usually compiled from other common languages like C/C++. If you as a user are satisfied with them, I wouldn’t recommend that you switch to another language. If you only want to go as far as writing fast loops with common numeric types, I would suggest you use Numba for easy integration into your Python skillset; similar to Julia’s implementation, Numba is an LLVM-based JIT compiler of a subset of Python and NumPy. Python-NumPy compliance does limit Numba; you can do far more in Julia.

With regards to performance, the advantage of Julia is the performant code can and is written more in Julia directly, so if you’re a developer who wants to build something in a dynamic, highly polymorphic language, Julia provides a greater level of control over the performance than Python. That said, you do need to spend the time to learn a new language and its ecosystem, so if you already wrap C code in Python and it serves your purposes, you have no real need to switch to Julia.


Except that Julia is much more fun :blush:


Many opinions above are valid. I would just add that how productive one is in a given PL can be so
much more important than how fast the programs run. Julia is great in this respect. Python: not so much (for me).


I don’t really get why everyone here is saying there’s not much of a difference. IMO the 1.5-2x difference seen in this microbenchmark is pretty good! Who doesn’t want a free 2x speedup? Especially when there’s ways to go even faster if one learns more.


Because 1.5-2x is a small enough difference that seemingly “unrelated factors” begin to make a difference: What computer you’re using, what version of LLVM, even what version of Python given Python’s performance improvements the latest releases.

Microbenchmarks difference of 1.5x is often not reliably reproduced, and IMO certainly not enough to be a reason to switch programming languages.