Making Julia as Fast as C++

This is an old post but it seems it hasn’t been discussed yet.

An interesting observation the author found is that non-integer powers in Julia are fairly slow at that time, compared to C++. I am not sure if this is really the case. He mentioned removing those operations speeds up the Julia code by more than 10 times. This seems too much, but maybe at that time there were some problems in the non-integer powers.

Note that this post was in 2019 so kind of old, but I think Julia 1.0 had been released at that time.

A lot of things changed then (really, too many to enumerate), but the basic strategies remain relevant.

That said, it is unclear what you want to discuss about this blog post. Asking a specific question would focus the discussion.

Cf

which was closed 5 years ago.

Really interesting, indeed its a bit old and I think they wanted to keep things standard julia but with a little of packages you can beat the pythonic version by 97x without changing much :

using ConcreteStructs,LinearAlgebra, StaticArrays, BenchmarkTools
function generate_particles(PType, n, lambda; l=1, Gamma=SA[0.0,0.0,0.0])
    sigma = l/(n-1)*lambda
    particles = fill(zero(PType), n^3)
    xs = range(0, stop=l, length=n)
    # Adds particles in a regular lattice
    ind = 1
    for k in 1:n
        for j in 1:n
            for i in 1:n
                X = SA[xs[i], xs[j], xs[k]]
                particles[ind] = PType(X, Gamma, sigma)
                ind += 1
            end
        end
    end

    return particles
end
"""
This is a particle struct made up of ambiguous (unspecified) types
"""
@concrete struct ParticleAmbiguous
    # User inputs
    X                               # Position
    Gamma                           # Vectorial circulation
    sigma                           # Smoothing radius
    # Properties
    U                               # Velocity at particle
    J                               # Jacobian at particle J[i,j]=dUi/dxj
end
ParticleAmbiguous(X, Gamma, sigma; U=zeros(3), J=zeros(3,3)) = ParticleAmbiguous(X, Gamma, sigma, U, J )
# Empty initializer
Base.zero(::Type{<:ParticleAmbiguous}) = ParticleAmbiguous(SA[0.0,0.0,0.0], SA[0.0,0.0,0.0], 0.0)
# Winckelmans regularizing kernel
g_wnk(r) = r^3 * (r^2 + 2.5) / (r^2 + 1)^2.5
dgdr_wnk(r) = 7.5 * r^2 / (r^2 + 1)^3.5
const const4 = 1/(4*pi)
"""
    Pythonic programming approach
"""
@views @fastmath function P2P_pythonic(particles, g, dgdr)
    for Pi in particles
        @inbounds for Pj in particles    
            dX = Pi.X - Pj.X
            r = norm(dX)
            if r != 0
                gsgm = g(r / Pj.sigma)
                dgsgmdr = dgdr(r / Pj.sigma)
                crss = cross(-const4 * (dX/r^3), Pj.Gamma)
                Pi.U .+= gsgm * crss
                @inbounds @simd for j in 1:3
                    Pi.J[:, j] += ( dX[j] / (Pj.sigma*r) * (dgsgmdr*crss) -
                        gsgm * (3*dX[j]/r^2) * crss -
                        gsgm * (const4/r^3) * cross(SVector{3}(i==j for i in 1:3), Pj.Gamma) )
                end
            end
        end
    end
end
particles_amb = generate_particles(ParticleAmbiguous, 6, 1.0)
@btime P2P_pythonic($particles_amb, $g_wnk, $dgdr_wnk) # 2ms

I don’t know how it compares with the c++ if you want to try go ahead

Thanks. It is useful to know that now Julia can easily generate code which has similar performance to C++.

And I didn’t know the package ConcreteStructs. Thanks a lot.

It is a nice convenience package, but just makes it easier to write idiomatic Julia code, not the source of any speed improvement per se. The key is StaticArrays, which does the unrolling etc.

actually without it all the component would be understood as Any, but I think you meant you could just parametrize everywhere which is true

Yes. It is of course tedious.

Building on Tamas_Papp point: The basic optimization strategies remain relevant regardless of the language. Tools like ConcreteStructs and StaticArrays are fantastic because they make it much easier to write idiomatic code that is fast, by giving the compiler the type stability and memory layout it needs to do its job.

I think the original post also highlights how hard it is to do these cross-language comparisons well: it’s easy to measure the difference between specific implementations of algorithms, but that is somewhat separate from the question of how close to language / hardware limits those implementations actually are.

For instance, even from a quick glance at the C++ code used in that post, there are obvious low-hanging optimizations to be made. With only a small amount of work one can get benchmarking results like:

Activating project at `~/test/post-juliavscpp/code`
Benchmarking Original Julia...
Benchmarking Better Julia...
Benchmarking Original C++...
Benchmarking Better C++...

=================================================================
Version                        | Time (ms)  | Speedup
-----------------------------------------------------------------
Julia (final post version)     |      2.961 |      1.00x
Julia (yolhan_mannes version)  |      2.354 |      1.26x
C++ (post version)             |      1.210 |      2.45x
C++ (better version)           |      0.400 |      7.40x

(All the usual caveats: precise profiling is hard, this is on my hardware, etc etc.)

I’m not trying to declare a “winner,” – I’ve been tremendously enjoying my experience with Julia, but I’ve been using it for almost an order of magnitude fewer years than I’ve been writing code in C/C++. I suspect that an expert could look at the Julia implementations and generate a yet-better version. I guess I just want to reinforce that the benefits of these posts probably lie in the broad principles for optimization within and across languages, but a lot of the specific details of the “X is slower/faster/as-fast-as Y!” discourse is often more about developer optimization effort.

I agree, its always skill issues, what I like about julia is that you can go from the straigntforward but insanly slow code to a very well optimised code in the same language, it also mean julia learning curve can feel overwhelming but still, i don’t know a lot of language that can do that within the same language.
Also if not happy just copy paste the llvm from clang and here is your perf (like it was easy) :joy:

The benchmark you show is interesting. Now your better C++ code is not only faster than the Julia version by at least 5 times, it also faster than the old C++ version by 3 times. What did you do to make this?

Just a grab bag of what I think are pretty standard techniques (and work across languages): hoist invariants out of loops (and other applications of the “don’t compute the same thing twice” principle), use a SoA layout of data instead of AoS, etc.
For this particular code, it’s also worth noticing that the combination of std::pow(...,2.5) and compiler flag -o3 is a pessimization (even if you’re also using the -ffast-math flag, depending on the compiler)

Awesome, so maybe the same optimization can be applied to the Julia code.

What do you mean by “the combination of std::pow(...,2.5) and compiler flag -o3 is a pessimization”?

yes I thought I could go simple

AA = StructArray(particles_amb)
@btime P2P_pythonic($AA, $g_wnk, $dgdr_wnk)

but its not that easy, you will actually need to change the layout a lot because this will do vector of vectors and we don’t win anything.
A way to cheat but it becomes unfair is

@views @fastmath function P2P_pythonic(particles, g, dgdr)
    Threads.@threads for Pi in particles
        @inbounds for Pj in particles    
            dX = Pi.X - Pj.X
            r = norm(dX)
            if r != 0
                gsgm = g(r / Pj.sigma)
                dgsgmdr = dgdr(r / Pj.sigma)
                crss = cross(-const4 * (dX/r^3), Pj.Gamma)
                Pi.U .+= gsgm * crss
                @inbounds @simd for j in 1:3
                    Pi.J[:, j] += ( dX[j] / (Pj.sigma*r) * (dgsgmdr*crss) -
                        gsgm * (3*dX[j]/r^2) * crss -
                        gsgm * (const4/r^3) * cross(SVector{3}(i==j for i in 1:3), Pj.Gamma) )
                end
            end
        end
    end
end

and I get the 0.4ms of the c++ but I guess the c++ is single threaded

There are some small tweaks you can do to make the thing even faster

  • Use mutable MVectors instead of usual ones:
ParticleAmbiguous(X, Gamma, sigma; U=@MArray(zeros(3)), J=@MArray(zeros(3,3))) = ParticleAmbiguous(X, Gamma, sigma, U, J)
  • Remove @views - you do not need them anymore, since MVectors are stack-allocated
  • You can also remove @inbounds and @simd on the inner loop for the sake of cleanness - it is too short to be properly vectorized, and the @inbounds is redundant because the @inbounds on the outer loop already does the job

On my machine this reduced the runtime from 2.13 ms to 1.96 ms (Apple M2 pro).

As the author of the post stated, there is not much else we can do, since the power function is the heaviest workload left. However we can do this:

_pow2d5(x) = x^2 * sqrt(x) # x^2.5
_pow3d5(x) = x^3 * sqrt(x) # x^3.5
g_wnk_fast(r) = r^3 * (r^2 + 2.5) / _pow2d5(r^2 + 1)
dgdr_wnk_fast(r) = 7.5 * r^2 / _pow3d5(r^2 + 1)

This yields 0.67 ms on my machine - close enough. Seriously though, generally it is not a very good idea, you may encounter precision loss

The std::pow function, in general, is required to do extra work (handle subnormals, deal with zero and negatives, set errno), and here we can get away with using the optimized std::sqrt function.

Using only -O3 compilers are not allowed to replace pow(x, 2.5) with x*x*sqrt(x)… Modern gcc and clang are pretty good about making this optimization if you also use the -ffast-math flag, but the way this interacts with vectorized versions of functions is tricky.

(Also, yes – I wrote a single-threaded cpp version, to keep things easy)

Is the concern about precision correct? I’m always happy to be corrected, but I was under the impression that sqrt is guaranteed to be rounded correctly to 0.5ULP, whereas exponentiation in general isn’t… In this case I would expect the replacement of the pow with sqrt to be a change that is both safe and fast.

They will be about roughly equally accurate.

To be honest, the code just slightly reeked to me, I was not completely sure. However, we can run a test using BigFloats:

using Statistics
A = rand(10000) .+ 4
Ab = big.(A)
res = @. sqrt(Ab ^ 7) # let's consider this ideally precise

std(@. res - A^3.5)              # 9.5e-15
std(@. res - A^3 * sqrt(A))      # 2.0e-14

2X precision difference :grinning_face:

So which of these fixes and suggestions are still valid these days?

We’re outside of my area of expertise, so (a) happy to have an excuse to play around and learn a bit and (b) would love to hear from someone on this forum that knows more! My understanding was that sqrt itself is required to be accurate by IEEE to 0.5 ULP, and that as a transcendental function pow is recommended but not mandated to be accurate to the same level (and, hence, whether it is or not is is implementation-defined).

I like the idea of using BigFloat as an oracle to easily measure how accurate different versions are. At a minimum, though, I think we should be measuring in ULPs relative to the answer. Something like:

using Statistics

function ulp_test(f, domain; n_samples=100000)
    a, b = extrema(domain)
    x = a .+ rand(Float64, n_samples) .* (b - a)
    vals = f.(x)
    big_float_x = BigFloat.(x)
    exact_vals = f.(big_float_x) # "exact"
    ulp_scale = eps.(Float64.(exact_vals)) # measure ULP relative to answer
    ulp_errors = (vals .- exact_vals) ./ ulp_scale

    return (
        std_ulp = Float64(std(ulp_errors)),
        max_abs_ulp = Float64(maximum(abs, ulp_errors))
    )
end

Running a few tests:

julia> ulp_test(x->sqrt(x),(0.0,1.0))
(std_ulp = 0.28975401566751907, max_abs_ulp = 0.49999874016920975)

julia> ulp_test(x->x^pi,(0.0,1.0))
(std_ulp = 0.875775912554217, max_abs_ulp = 9.312060290475852)

julia> ulp_test(x->x^3.1415926,(0.0,1.0))
(std_ulp = 0.2892737541146482, max_abs_ulp = 0.5210410859100494)

So far, as expected: the sqrt function correctly rounds, and exponentiation doesn’t have to according to this (probably not-quite-correct) function. On the other hand, switching to exponentiation by a quantity which is more clearly a float looks basically fine.

Finally, we can look at the question at hand:

julia> ulp_test(x->x*x*x*sqrt(x),(0.0,100.0))
(std_ulp = 0.6003549004331392, max_abs_ulp = 2.5012293636064946)

julia> ulp_test(x->x^3.5,(0.0,100.0))
(std_ulp = 0.2887374737640995, max_abs_ulp = 0.5205982181157858)

Quite clear: in the first (faster) version, the ULP errors accumulate over all of the internal operations used to build up an expressions, as one would expect. In the second version, Julia is clearly using a slower version of exponentiation that maintains accuracy over the domain.

Whether this difference in accuracy is relevant is, of course, domain-specific.