I heard some Python guys said that Julia’s niche will get narrower over time. The argument is that the basic built-in functions of Python are written in C so they are faster than the Julia counterparts. For large packages, people can use C++ to get extreme performance with its black magic. Julia is only preferable for mid-sized projects. Over time, this niche will get narrower because Python’s built-in functions and large packages both become more comprehensive. Is this true?
The key problem with this argument is that “basic built-in functions of Python are written in C so they are faster than the Julia counterparts” is false.
You will also found a plenty of examples here, especially by @ChrisRackauckas, where he shows that with the approach where python is a glue language to C libraries, you cannot easily mix different libraries. For example automatic differentiation with ODE.
You can see this on a difference in ecosystem. Julia tend to have relatively small and general packages which you can mix together, whereas those in python tends to be large and tightly coupled.
why did we spend 10 years making Numpy ecosystem?
so extreme that people have to re-write Numpy functions with C++ again and still be 2x slower than Julia, see Bottleneck · PyPI
also, please don’t make a post because “some python guys” said something that is likely false/ill-informed. I’m sure you’ve seen Julia function “not slower” than Python functions that are “fast just because there were written in C”?
We recently added Julia to our codebase, where we’d previously used Python only for 3 years. This was motivated by optimizing a major process that was a key part of our product and cost $15k/month to run. As you can imagine, we had already heavily optimized it in Python. Everything was using numpy and calling C libraries beneath the hood, but besides that we employed many performance tricks that resulted in very ugly and unpythonic code. We had several very talented engineers look at this, including one who specializes in optimizing numeric Python code.
In a few days of work, using Julia for the first time, I was able to produce a prototype that was 14 times faster than the heavily optimized Python code. There are a lot of reasons for that, but a few of the main ones were Julia’s light-weight threading model and the fact that profiling in Julia is tremendously easier, since it’s “Julia all the way down.” A few weeks later that code is in production, and has seen another ~2x speedup for a total of ~30x. As far as I can tell, the improvements I made wouldn’t be possible to replicate in Python right now. The fact that we spent 200+ expert hours on the Python version vs. 10 hours of a beginner’s time on the Julia also tells you a lot about the developer experience.
So I don’t think Python is narrowing Julia’s niche – if anything, Julia’s niche is currently expanding as more and more libraries are developed.
This kind of thing has been discussed a lot. I think this particular argument uses an incorrect assumption that a problem domain’s methods and challenges are fixed, so in the infinite-time limit, somebody will generate the optimal machine code with a Python wrapper. What if that frontier’s moving, because people are making research progress? If you’re trying to instead develop new methods and ideas, it’s hard to beat Julia.
Basic built-in functions in Julia are likely to be at least as fast as generic C-counterparts.
For example, many of Julia’s special functions are implemented in pure Julia instead of C or Fortran.
Let’s run a quick benchmark:
# uses the system C library
clog(x) = ccall(:log, Float64, (Float64,), x)
# uses LLVM's log
llvmlog(x) = ccall(Symbol("llvm.log.f64"), llvmcall, Float64, (Float64,), x)
@btime log($(Ref(1.2))[])
@btime clog($(Ref(1.2))[])
@btime llvmlog($(Ref(1.2))[])
I get:
julia> @btime log($(Ref(1.2))[])
5.252 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime clog($(Ref(1.2))[])
6.787 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime llvmlog($(Ref(1.2))[])
6.359 ns (0 allocations: 0 bytes)
0.1823215567939546
Or exp
:
# uses the system C library
cexp(x) = ccall(:exp, Float64, (Float64,), x)
# uses LLVM's
llvmexp(x) = ccall(Symbol("llvm.exp.f64"), llvmcall, Float64, (Float64,), x)
@btime exp($(Ref(1.2))[])
@btime cexp($(Ref(1.2))[])
@btime llvmexp($(Ref(1.2))[])
Producing:
julia> @btime exp($(Ref(1.2))[])
6.266 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime cexp($(Ref(1.2))[])
11.079 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime llvmexp($(Ref(1.2))[])
10.734 ns (0 allocations: 0 bytes)
3.3201169227365472
Julia is also really good at magic performance tricks should you need them when it comes to larger projects.
Here is a discussion I have with someone who does not belive that Julia functions are sometimes faster than Numpy.
In my tests, element-wise exp in Julia is ~20% faster, the sum of a vector even 4 times.
Only memory allocations are much faster in Numpy than in Julia (10 µs vs. 1 ms for 1M Float64).
Hmm, one person is posting code showing Julia is faster which I double checked locally, and the other is saying “I don’t buy reverse-FUD” and saying performance numbers without any code. Who to believe this is a hard one. Wait a second, if you search Julialang/julia for pull requests, you find:
Which is exactly the change named exactly how you’d expect merged at the time you expect. But nah, “reverse-FUD” . What is this kind of thinking where it’s an easily searchable fact that is being “argued”? This is bizarre.
julia> using BenchmarkTools
julia> @btime Vector{Float64}(undef, 10^6);
8.377 μs (2 allocations: 7.63 MiB)
Are you by any chance comparing initilaized vs uninitialized memory?
julia> @btime zeros(10^6);
452.453 μs (2 allocations: 7.63 MiB)
Yes, you are right, I compared the creation of an initialized array (all elements set to 0) both in Numpy and Julia
It looks like not the allocation but the initialization costs the majority of time in Julia:
Python:
%timeit a = np.empty(1_000_000)
9.55 µs ± 107 ns per loop
%timeit a = np.zeros(1_000_000)
9.94 µs ± 164 ns per loop
%timeit a = np.ones(1_000_000)
1.4 ms ± 3.02 µs per loop
Julia:
@btime Vector{Float64}(undef, 1_000_000)
12.200 μs (2 allocations: 7.63 MiB)
@btime x = zeros(1_000_000)
1.199 ms (2 allocations: 7.63 MiB)
@btime x = ones(1_000_000)
1.255 ms (2 allocations: 7.63 MiB)
versioninfo()
Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
Edit: interestingly, in Numpy initializing the array with ones is much slower than with zeros, in Julia both take the same time.
Should zeroing the memory take this long? Something feels off here.
numpy is probably using calloc
, see e.g.:
here and comments below/above: https://github.com/JuliaLang/julia/issues/9147#issuecomment-312722941
That’s O(1cycle/element) so it’s not insane, but clearly it should be possible to do better.
c - Why malloc+memset is slower than calloc? - Stack Overflow seems to suggest that filling with zeros is a special case: in particular, until you start storing nonzeros the kernel might lie to you and mmap
a page of all-zeros. Once you actually need the memory, though, it would have to pause and replace the buffer with real memory, also initializing it properly. So the fast performance of Python’s zeros
probably doesn’t really matter much in practice.
Looking into the Julia source code, both zeros and ones allocate first an undefined array (which may or may not be already zero) and then fill all elements as 0 or 1.
for (fname, felt) in ((:zeros, :zero), (:ones, :one))
@eval begin
$fname(dims::DimOrInd...) = $fname(dims)
$fname(::Type{T}, dims::DimOrInd...) where {T} = $fname(T, dims)
$fname(dims::Tuple{Vararg{DimOrInd}}) = $fname(Float64, dims)
$fname(::Type{T}, dims::NTuple{N, Union{Integer, OneTo}}) where {T,N} = $fname(T, map(to_dim, dims))
function $fname(::Type{T}, dims::NTuple{N, Integer}) where {T,N}
a = Array{T,N}(undef, dims)
fill!(a, $felt(T))
return a
end
function $fname(::Type{T}, dims::Tuple{}) where {T}
a = Array{T}(undef)
fill!(a, $felt(T))
return a
end
end
end
If the initialized values are actually used, the timings are roughly similar in Numpy and Julia:
%%timeit
...: a = np.zeros(1_000_000)
...: a[:] += 1
...:
1.78 ms ± 10.6 µs per loop
vs.
@btime begin
x = zeros(1_000_000)
x .+= 1
end
1.649 ms (2 allocations: 7.63 MiB)
Therefore I think @tim.holy is right - when the allocated zeros are actually used, the performance benefit of Numpy is gone.
I have spent some time reading Reddit posts in the past month. My impression is that most R users don’t believe Julia is faster than Rcpp and most Python users don’t believe Julia is faster than Numpy.
In a Reddit thread which I could not find anymore, a Julia user said if a Julia code is not as fast as C, that should be considered a bug. Is this claim an exaggeration?
Julia provides more memory protections than C/C++ so I think Julia can come close when LLVM can optimize what you are doing. The @inbounds macro helps a lot here but I think there are probably some unsafe fast things you could do with C/C++ that you probably can’t duplicate in Julia.
Try writing a matrix-multiplication algorithm in C/C++ that doesn’t get blown out of the water by the performance of
using LoopVectorization
function A_mul_B!(𝐂, 𝐀, 𝐁)
@avx for m ∈ axes(𝐀,1), n ∈ axes(𝐁,2)
𝐂mn = zero(eltype(𝐂))
for k ∈ axes(𝐀,2)
𝐂mn += 𝐀[m,k] * 𝐁[k,n]
end
𝐂[m,n] = 𝐂mn
end
end
And then also compare how much code you had to write? See benchmarks at https://chriselrod.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/#Matrix-Multiplication.
In my opinion yes, but not by all that much. Well optimized C code that uses all the tricks in the book is still difficult to match. For fun I’ve tried porting libdeflate to Julia as literally as possible and I still haven’t done better than 2-3 times slower. Probably the best Julia wizards can improve on that though. For more pedestrian C code you basically can expect the same performance and will have a much easier time adding SIMD optimizations (if applicable) or multithreading.
When I port Matlab code with mixed in pieces of C implementations to Julia I never even consider keeping the C code as is (I would if it was large, but that hasn’t been the case). In fact porting the C parts tends to be easier than the Matlab code, even though you have to take care of 1-based versus 0-based indexing.