Hey, looks like I got pinged for a reply! I do fit in a strange cadre of a person who started developing a library in Julia whose core functionality has been rewritten in C. So here is some insight into that decision:
- I had some interest from folks in the Python community and currently C interoperability is more universal (that is, the original motivation was not performance);
- Until integration with BinaryBuilder, the Julia package was maintained in parallel with C development;
- For increased performance, the C library uses SIMD and OpenMP. For most applications, Julia’s vectorization and shared memory parallelism capabilities are going to be sufficient, easy, and excellent. However, I found that the necessary vectorization paradigm is not one that may be turned on by a macro like, e.g. @avx. (For reference, it is essentially multiple sweeps of Givens rotations between contiguous or nearly contiguous entries in a vector. Each sweep is nonlocal in memory, though, as it will touch all entries in the vector. So to increase vectorization also requires increasing memory locality, something that requires collating multiple sweeps into larger blocked sweeps.) Looking at the source code, it is templated (using C macros) to encourage the compiler to use just the right vectorization intrinsics.
Is rewriting a Julia package in C the solution for everyone? I think not.
Could this be a counter-example to the general consensus in the community that “Julia’s applicable context is getting narrower over time”? No. Re-writing in C combined with the BinaryBuilder infrastructure did not diminish the scope of the Julia package FastTransforms.jl.
Writing even a simple function in Julia is actually a meta-experience if the arguments are not type-annotated. This provides great ease of programming, and often this abstraction makes things work that were inconceived even by the programmer let alone the user, which is sometimes really, really awesome! In numerics, it is often desirable or even simply sufficient and acceptable to have a strongly typed solution for IEEE floating-point types, so this generality is not always necessary.
There is still room for improvement in Julia, though. Improving BigFloat
is a long-standing issue because safe memory handling implies that Julia controls BigFloat
memory while C controls mpfr_t
memory and they cannot be mixed. This means that sizeof(BigFloat) != sizeof(FastTransforms.mpfr_t)
. I did find that implementing an upper-triangular matrix * dense matrix multiplication for MPFR types was significantly faster than the generic linear algebra routines in Julia. So when folks mention that the performance of C and Julia are nearly the same because the LLVM compiler is the same, this is a reasonable general statement that always comes with exceptions.
For reference, here is the source code that is the cause for this triangular matrix * dense matrix multiplication speedup (the times and disparity are so large that I don’t think there’s a need to use BenchmarkTools.jl):
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Xeon(R) W-2191B CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
julia> using FastTransforms, LinearAlgebra
julia> n = 1024
1024
julia> p = plan_leg2cheb(BigFloat, n);
julia> Id = Matrix{BigFloat}(I, n, n);
julia> @time p*Id; # C library FastTransforms version ccall'ed by the Julia wrapper
3.583803 seconds (4.19 M allocations: 488.000 MiB, 3.58% gc time)
julia> P = UpperTriangular(p*Id);
julia> @time P*Id; # Generic linear algebra in Julia
123.838721 seconds (2.15 G allocations: 112.008 GiB, 20.71% gc time)
julia>
Hope this helps,
Mikael