This post claims that Julia is still 4-5 times slower than cython

Couldn’t help noticing this fact. Since both numpy and cython are C based, we have to conclude that C (cython) is ~300 times faster than C (numpy). :smile:

Note, sorry if my quote makes think that the quote is from @Henrique_Becker. It’s not. He just quoted the original.

3 Likes

It looks like he’s using PyJulia in a notebook. Which lets him conveniently generate data in Python and import into the Julia environment. But I wonder if you ran Julia from, say, Visual Studio, it would be faster.

If it is just a Jupyter notebook, I think not, or is he doing something different?

Just skimmed the post so I won’t have any big insights here, but I noticed that:

  1. OP uses @timed once so they are probably picking up compilation time. I think they may counting compilation time on purpose, they do it with Numba (also JIT, LLVM), though that benchmark runs it 10 times and divides the total runtime by 10. EDIT 2: Upon closer reading, OP runs the Numba and Julia code once to get the JIT compilation out of the way prior to the benchmarking.

  2. the latest Julia code still does @inbounds loops instead of in-place broadcasting (I think the underlying loop is @inbounds @simd, if my reading of copyto! in broadcast.jl is right), which I think was the wrong way to go for the code (they rewrote vectorized code as loops because it helped in Numba and Pythran). EDIT 1: actually a loop does elide the allocation of a temporary array before summing, is there a lazy-broadcast-reduce?

  3. I can’t really tell what the input data is exactly, but OP does say they’re working with Complex{Float64} (or double complex in C-speak), yet they initialize with Xest = 0+0im for all the loops.

Well, common sense AND some programming knowledge. I barely know anything about computers, and I started using Julia with even less knowledge. Specializations, type stability, static vs dynamic dispatch, pointers, allocations, garbage collection spikes, SIMD, I had no idea what these were because all I had ever used was Python, a language that abstracted all these details away. So I completely relate to a lot of people who couldn’t make it past “walks like Python, runs like C”.

Funnily enough, playing with Julia taught me what to look out for when I had to do a bit of Numba. Frustrating, too, because there’s a lot of stuff I couldn’t do in Numba that would be trivial in Julia.

12 Likes

I think he mentions that has taken the second run as a measure.

The loops are fine, but have made the code much more bloated than needed. You can do that with mapreduce or simply using an iterator as the argument of the sum, i. e. Something like sum( exp(x) for x in y ).

That’s why nobody provides a better version… To cumbersome to actually run an example.

1 Like

Using my knowledge of the problem domain, I could generate more or less the same input data. Running in 9 ms, I bet the following Julia version beats all the versions published in that blog post. In my own experience, properly written Julia code matches best of the industry compilers (ifort) and even beats them at many times.

function apply_filter_jl(E, wx)
    Xest = 0.0 + 0.0im
    for j = 1:size(E,2)
        @simd for i = 1:size(E,1)
            Xest += E[i,j] * wx[i,j]'
        end
    end
    return Xest
end

@views @fastmath function cma_jl(E, wxy, mu, R, os)
    L, pols = size(E)
    ntaps = size(wxy,1)
    N = Int( floor( (L/os/ntaps-1)*ntaps ) )
    err = Matrix{ComplexF64}(undef, L, pols)
    for k = 1:pols
        for i = 1:N
            X = E[i*os-1:i*os+ntaps-2, :]
            Xest = apply_filter_jl(X, wxy[:,:, k])
            err[i,k] = (R-abs2(Xest)) * Xest
            wxy[:,:,k] .+= mu .* err[i,k]' .* X 
        end  
    end
    return err, wxy
end

L = 10^5  # number of symbols
SNR = 10
sig = 1/sqrt(2).*rand((1+1im, 1-1im, -1-1im, -1+1im), L, 2)
sigPower = sum(abs2, sig) / L
noisePower = sigPower / SNR
noise = sqrt(noisePower) .* randn(ComplexF64, L, 2)
s4 = sig .+ noise

wxy0 = zeros(ComplexF64,21,2,2)
wxy0[22÷2, 1, 1] = 1
wxy0[22÷2, 2, 2] = 1
wxyin = copy(wxy0)

@btime cma_jl($s4, $wxyin, 1e-3, 1, 2)
  # 9.942 ms (2 allocations: 3.05 MiB)
13 Likes

Isn’t apply_filter just dot, with the arguments reversed?

3 Likes

I think, yes, but I’m not sure how this can help improve the performance.

It’s barely faster for me. But I meant mostly for clarity & compactness. Writing and carefully optimising loops when there’s a fast standard function which does exactly this seems like a practice to be discouraged.

In fact, can’t you skip the k loop entirely, and write something like:

    mat_wxy = reshape(wxy, :, size(wxy, 3))
    for i = 1:N
        vec_X = vec(E[i*os-1:i*os+ntaps-2, :])
        Xest_k = mat_wxy' * vec_X

I guess that won’t be so fast, since vec_X is a reshaped view. But if E had its dimensions permuted, then it would be.

How fast were the blog’s variants, on the same computer, BTW?

2 Likes

I hope you stay sane given that more beginners are trying out Julia every day :upside_down_face:

Jokes aside, it appears that the poster welcomes more optimization ideas. Here’s an excerpt from the blog post’s conclusion:

Still a factor of 4-5 away from cython and pythran and it would be interesting to see where that difference is coming from (I’m sure I’m doing something non-optimal and if you know what please let me know). I was probably a bit harsh on Julia in my previous conclusion, this just goes to show that one needs to know what one is doing if you want to get the most out of a language.

I think the second part is an important observation that has been brought up by many people already i.e. one has to learn a few things in order to make things run fast in Julia. It makes sense to me Julia should continue to improve in this areas:

  1. More compiler optimizations that just magically make naive code run fast by default;
  2. Development tooling that can identify common performance issues and suggest code changes;
10 Likes

Totally agree, I think when we achieve the promises made by the Julia developers here, especially Array Optimizations most of these initial performance issues will go away.

2 Likes

With the dot in place of the loop, the version of @Seif_Shebl is probably both faster and cleaner. Probably sending the thread to the author would be nice.

As @tk3369 mentions, these things will be more and more common, and showing how things should be done is always a productive exercise, for us and for new users.

6 Likes

Agreed.

as much as i appreciate your response, it makes use of several macros. I’ve seen them before, in this forum, but i haven’t used any of them.

@views
@fastmath
@simd

I think many people would want julia to generate fast code for “naive” code. There’s a limit of course. row order/column order, pre-allocating, etc… However, I claim that these represent things that even moderately experienced coders familiar with numerical work would know.

The question is, am i going to get code that is a factor of 4 or 5 away from C-like performance if i don’t use things like “@simd” ? Because if that’s the case, especially for what looks to me like very straightforward looped computation, that’s not really a good thing.

I don’t think @simd and @fastmath actually makes a difference in this case, although it should be good to add @inbounds. See my test result here (7.39ms on my Apple M1 laptop).

The main thing is @views, which avoids allocating temporaries.

4 Likes

Neither @inbounds is needed (or makes any difference outside here) if you use the dot version, which is actually slightly faster:

using LinearAlgebra:dot

apply_filter_jl3(E, wx) = dot(wx,E)

@views function cma_jl3(E, wxy, mu, R, os)
    L, pols = size(E)
    ntaps = size(wxy,1)
    N = Int( floor( (L/os/ntaps-1)*ntaps ) )
    err = zeros(ComplexF64,L,pols)
    for k = 1:pols
        for i = 1:N
            X = E[i*os-1:i*os+ntaps-2, :]
            Xest = apply_filter_jl3(X, wxy[:,:, k])
            err[i,k] = (R-abs2(Xest)) * Xest
            wxy[:,:,k] .+= mu .* err[i,k]' .* X
        end
    end
    return err, wxy
end

(I also changed the undef initialization of err to a zeros, which is syntactically simpler)

1 Like

Whether Julia approaches C via @simd specifically can depend on if the C code itself is written with SIMD intrinsics (special methods and types) or manages to compile so with automatic vectorization. Julia is a high(er) level language than C, so a macro that converts code is more characteristic than manually rewriting with special methods and types.

Currently the compiler automatically does SIMD in inner loops in some cases, and the @simd macro is there if you’ve written the loop in a way that can give the compiler more leeway to SIMD. It’s not the only macro for setting up SIMD, either, LoopVectorization.jl’s @turbo does that among other things.

End-users often won’t have to care about SIMD because methods are often designed to leverage SIMD and abstract this detail away. The inner workings of broadcasting in broadcast.jl is annotated with @simd, for instance.

3 Likes

As others have said, you don’t really need all these annotations, in fact only @views is required and you get the same timings compiling the code with julia -O3 --check-inbounds=no. Also @simd is not required in most simple cases, the compiler will automatically do it. That said, Julia prefers to do things explicitly by adding some macros instead of enabling a global compiler option that can be dangerous in cases where accuracy is required. So, Julia choices at the moment help prevent silent code breaks that can be costly. Also, Julia developers are making big strides towards making most of these issues vanish with time. As an example, before Julia version 1.5, array views were allocating, but from 1.5 forward, they don’t. Have a look at the TO DO list I have linked above from the Julia 2021 talk by the core developers of Julia. These optimizations are very promising and many issues will disappear when these get done.

7 Likes