Julia slower than Matlab & Python? No

So true :joy:
https://meta.wikimedia.org/wiki/Cunningham's_Law

12 Likes

Wouldnt u get it after step 3

For what it’s worth:

julia> using LoopVectorization, BenchmarkTools

julia> C = rand(100,100,100);

julia> D = similar(C);

julia> @btime @avx $D .= $C.^0.3;
  5.376 ms (0 allocations: 0 bytes)

julia> @btime $D .= $C.^0.3;
  13.440 ms (0 allocations: 0 bytes)

julia> vC = vec(C); vD = vec(D);

julia> @btime @avx $vD .= $vC.^0.3;
  4.555 ms (0 allocations: 0 bytes)

This requires LoopVectorization version 0.3.3, which should merge shortly (there was a bug in earlier versions that I’ve fixed).

EDIT: If you’re willing to give up a little more accuracy:

julia> @btime @avx ($vD .= log.($vC); $vD .= exp.(0.3 .* $vD));
  2.175 ms (0 allocations: 0 bytes)

julia> vD ≈ vC .^ 0.3
true

Less accurate, but still isapprox correct.

4 Likes

I just pull ] add LoopVectorization#master and tried:

using BenchmarkTools, LoopVectorization

a = rand(100,100,100);
b = rand(100,100,1); # a and b are not of same size.

@btime @avx a .+ b; # Crashing...

It’s crashing in case that two matrices are not the same size. Do I need to file a report?

Additionally I tried (on Windows 10, Intel i7 :

a = rand(100,100,100);
b = rand(100,100,100);

@btime a .+ b; # 2.241 ms (4 allocations: 7.63 MiB)

@btime @avx a .+ b; # 2.904 ms (4 allocations: 7.63 MiB)

@btime @strided a .+ b; # 1.116 ms (72 allocations: 7.64 MiB)

No, that behaviour is expected. LoopVecotrization is meant for ultra high performance, low safety operations. You need to verify correctness prior to sticking something in @avx and even if stuff runs, you need to manually make sure it’s giving you correct results.

As for your timing against Strided, that’s also expected. Vector addition is already pretty well vectorized, so you won’t see any great advantage from LoopVectorization, but Strided does multithreading, so that’s consistent with ~2x performance increase you observed.

3 Likes

Oh, I see. Thanks!

Strided does multithreading, so that’s consistent with ~2x performance increase you observed.

GitHub - Jutho/Strided.jl: A Julia package for strided array views and efficient manipulations thereof : here the example in the document shows Stride is effective even in a single thread, which is contrary to my experiment. I wonder if you have any idea on this.

EDIT: sorry. I tried again and have found that @strided is effective even in single thread. I wonder how it could even be possible and why it is not default algorithm in Julia.

EDIT2: another observation: actually it depends on the sizes of matrices being broadcast. When A and B are same size, @stride is effective in single thread too. But when the sizes are different, it is only effective in multi threads. Oh, I bothered you enough…I need to stop. :slight_smile:

using Strided, BenchmarkTools

A = randn(4000,4000);

B = rand(4000, 1)

@btime ($A .+ $B') ./ 2; 

@btime @strided ($A .+ $B') ./ 2;

Note that had you used:

a = rand(100,100,100);
b = rand(100,100);

It would have worked.
I added a couple tests showing workarounds.
For now, you have to indicate missing dimensions with type information. On LoopVectorization 0.3.4:

using LoopVectorization, Test
T = Float64
a = rand(T,100,100,100);
b = rand(T,100,100,1);
bl = LowDimArray{(true,true,false)}(b);
br = reshape(b, (100,100));
c1 = a .+ b;
c2 = @avx a .+ bl;
@test c1 ≈ c2
fill!(c2, 99999.9);
@avx c2 .= a .+ br;
@test c1 ≈ c2
br = reshape(b, (100,1,100));
bl = LowDimArray{(true,false,true)}(br);
@. c1 = a + br;
fill!(c2, 99999.9);
@avx @. c2 = a + bl;
@test c1 ≈ c2
br = reshape(b, (1,100,100));
bl = LowDimArray{(false,true,true)}(br)
@. c1 = a + br;
fill!(c2, 99999.9);
@avx @. c2 = a + bl;
@test c1 ≈ c2

If I wanted to fix it automatically, one approach would be use a different stridedpointer function when broadcasting that sets strides corresponding to dims of size 1 to zero, something like:

@inline function broadcaststridedpointer(A::DenseArray{T,N}) where {T,N}
    stridesA = strides(A)
    sizeA = size(A)
    PackedStridedPointer(
        pointer(A),
        ntuple(n -> sizeA[n+1] == 1 ? 0 : stridesA[n+1], Val(N-1))
    )
end

However, if the first axis is zero, this will not work, because it guarantees unit stride in the first dimension. Lifting that guarantee would require dispatches to gather and scatter! instead of vload and vstore!. The former are much slower, but I use them when types show it is necessary (eg, a SubArray with an Int as the first index instead of range or colon).

I also improve performance on this test case. It was unrolling by 5, but this always seems to cause bad performance. For now I capped it at 4, but perhaps I should disallow 3 as well.

I’m still fiddling with tuning computational optimizations.
It does not consider memory optimizations yet, so there currently isn’t much point to benchmarking / evaluating it at large memory problems (but it will still help for vectoring things like ^, exp, or log.
Perhaps I should add a naive hack to make it at least favor putting loops in the correct order when the orders are computationally equivalent, to help until I implement a real model of memory access costs.

The + example is easy enough to optimize for the compiler, and at those sizes it should be memory bound. Same story with ‘(A + B) / 2’.
Unsurprisingly, I see about the same performance:

julia> using LoopVectorization, BenchmarkTools, Strided
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> A = randn(4000,4000); B = rand(4000);

julia> @btime @avx ($A .+ $B') ./ 2;
  26.284 ms (2 allocations: 122.07 MiB)

julia> @btime ($A .+ $B') ./ 2;
  26.389 ms (2 allocations: 122.07 MiB)

julia> @btime @strided ($A .+ $B') ./ 2;
  26.338 ms (6 allocations: 122.07 MiB)

julia> Threads.nthreads()
1

Although taking the first example from Strided’s README:

julia> Threads.nthreads()
1

julia> B = similar(A);

julia> @btime @strided $B .= ($A .+ $A') ./ 2;
  61.788 ms (18 allocations: 848 bytes)

julia> @btime @avx $B .= ($A .+ $A') ./ 2;
  48.441 ms (0 allocations: 0 bytes)

julia> @btime $B .= ($A .+ $A') ./ 2;
  67.328 ms (0 allocations: 0 bytes)

@avx uses gather instructions to load from the transposed matrix, so this might require AVX512 to be fast.

If we have more threads though, Strided obviously wins by a substantial margin:

julia> Threads.nthreads()
18

julia> @btime @strided $B .= ($A .+ $A') ./ 2;
  7.965 ms (120 allocations: 16.47 KiB)
6 Likes

Simply inlining the iterate function and changing

        c = @. y .- Qnext .* Bnext .+ B
        c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]

to

 c = @. y - Qnext * Bnext + B  |>  x -> x > 0.0 ? x : 1e-14

gives a 2X speed increase for me.

Here is the code (I recreated the “data” to make it a minimal working example):

using Statistics

# Reproduce data:

mu = collect(range(-0.03, 1.03, length=256))
mu[1] = -0.0789; mu[end] = 1.0789
Py2 = zeros(256, 256)
x = range(0, 1, length=256)
h = vcat(0.679, ones(254)*0.0287, 0.679)
sigma = vcat(0.0908, ones(254)*0.0816, 0.0908)
for i=1:256
    Py2[:, i] = h[i]*exp.(-((x.-mu[i])/sigma[i]).^2)
end

const Py = copy(Py2)
const logy_grid = range(-0.2293084801321751, 0.2293084801321751, length=256)

function main_original(nB, repeats)
    β = .953
    Îł = 2.
    r = 0.017
    θ = 0.282
    ny = size(logy_grid, 1)

    Bgrid = LinRange(-.45, .45, nB)
    ygrid = exp.(logy_grid)

    ymean = mean(ygrid) .+ 0 * ygrid
    def_y = min(0.969 * ymean, ygrid)

    Vd = zeros(ny, 1)
    Vc = zeros(ny, nB)
    V = zeros(ny, nB)
    Q = ones(ny, nB) * .95

    y = reshape(ygrid, (ny, 1, 1))
    B = reshape(Bgrid, (1, nB, 1))
    Bnext = reshape(Bgrid, (1, 1, nB))

    zero_ind = Int(ceil(nB / 2))

    function u(c, Îł)
        return c.^(1 - Îł) / (1 - Îł)
    end


    t0 = time()
    function iterate(V, Vc, Vd, Q)
        EV = Py * V
        EVd = Py * Vd
        EVc = Py * Vc

        Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
        Vd_target = reshape(Vd_target, (ny, 1))

        Qnext = reshape(Q, (ny, 1, nB))

        c = @. y .- Qnext .* Bnext .+ B
        c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
        EV = reshape(EV, (ny, 1, nB))
        m =  @. u(c, γ) .+ β * EV
        Vc_target = reshape(maximum(m, dims=3), (ny, nB))

        Vd_compat = Vd * ones(1, nB)
        default_states = float(Vd_compat .> Vc)
        default_prob = Py * default_states
        Q_target = (1. .- default_prob) ./ (1 + r)

        V_target = max.(Vc, Vd_compat)

        return V_target, Vc_target, Vd_target, Q_target

    end

    iterate(V, Vc, Vd, Q)  # warmup
    t0 = time()
    for iteration in 1:repeats
        V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
    end
    t1 = time()
    out = (t1 - t0) / repeats

end

function main_with_inlining(nB, repeats)#, logy_grid, Py)
    β = .953
    Îł = 2.
    r = 0.017
    θ = 0.282
    ny = size(logy_grid, 1)

    Bgrid = LinRange(-.45, .45, nB)
    ygrid = exp.(logy_grid)

    ymean = mean(ygrid) .+ 0 * ygrid
    def_y = min(0.969 * ymean, ygrid)

    Vd = zeros(ny, 1)
    Vc = zeros(ny, nB)
    V = zeros(ny, nB)
    Q = ones(ny, nB) * .95

    y = reshape(ygrid, (ny, 1, 1))
    B = reshape(Bgrid, (1, nB, 1))
    Bnext = reshape(Bgrid, (1, 1, nB))

    zero_ind = Int(ceil(nB / 2))

    function u(c, Îł)
        return c.^(1 - Îł) / (1 - Îł)
    end


    t0 = time()
    @inline function iterate(V, Vc, Vd, Q)
        EV = Py * V
        EVd = Py * Vd
        EVc = Py * Vc

        Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
        Vd_target = reshape(Vd_target, (ny, 1))

        Qnext = reshape(Q, (ny, 1, nB))

        # c = @. max(1e-14, y - Qnext * Bnext + B)
        c = @. y - Qnext * Bnext + B  |>  x -> x > 0.0 ? x : 1e-14
        EV = reshape(EV, (ny, 1, nB))
        m =  @. u(c, γ) + β * EV
        Vc_target = reshape(maximum(m, dims=3), (ny, nB))

        Vd_compat = Vd * ones(1, nB)
        default_states = float(Vd_compat .> Vc)
        default_prob = Py * default_states
        Q_target = (1. .- default_prob) ./ (1 + r)

        V_target = max.(Vc, Vd_compat)

        return V_target, Vc_target, Vd_target, Q_target

    end

    iterate(V, Vc, Vd, Q)  # warmup
    t0 = time()
    for iteration in 1:repeats
        V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
    end
    t1 = time()
    out = (t1 - t0) / repeats

end

n = 2
m = 151

println(1000 * main_original(m, n))
println(1000 * main_with_inlining(m, n))

n = 2
mm = [151,351,551,751,951,1151,1351,1551]
res = zeros(length(mm), 2)
for (i, m) in enumerate(mm)
    res[i, 1] = main_original(m,n)
    res[i, 2] = main_with_inlining(m,n)
    display(Int.(round.(res*1000)))
end

Result (right column is with inlining):

   189   105
   814   390
  1920   873
  3611  1639
  5760  2455
  8327  3546
 11313  4836
 14961  6321
2 Likes

Disclaimer: this is getting very much off topic and should maybe be split to a separate thread.

I am actually surprised that Strided manages to speed up such a simple operation by involving more threads. I would have thought that this operation would be mostly memory bound.

Anyway, there are two settings in which Strided can speed up array operations:

  • many computations per iteration, by using multithreading
  • memory unfriendly access patterns (permutdims and friends), even without threads

However, the microkernel in Strided is not very well optimized, and would certainly benefit from the work of @elrod on exploiting vectorization, to yield even further speedups. Unfortunately I don’t think I can just plug in a simple @avx decoration in the current kernel, because at that point the operation is already disected into manually incrementing (linear) indices with appropriate strides. I am unfortunately not very familiar with the low level vector instructions and how to call them from within Julia.

Maybe some collaboration could be fruitful, @elrod?

6 Likes

After several unsuccessful optimization attempts, I noticed something.
In the “warmup” run compute_price(Spot, σ, K, r) is called on (Int, Float64, Int, Float64).
In dP_dS = (compute_price(Spot + ξ, σ, K, r) - P) / ξ, the arguments of compute_price are (Float64, Float64, Int, Float64) etc.
So, the time reported by the script still includes compilation time.

1 Like

I tried changing Spot, σ, K, r so they are all float, but still no speed up.
Any luck @Vasily_Pisarev?

1 Like

Some speedup by consting all globals and replacing line 20 by

@. @views B[:,n] = 2 * x * B[:,n - 1] - B[:,n - 2]

But that still ends up slower than numpy.
From the CPU usage I’d conclude that “idiomatic” numpy is just better at passing the heavy lifting to BLAS than “idiomatic” Julia.

1 Like

I think a lot of it is that multithreading is only a few months old in Julia, so not much of Base takes advantage yet. If we could get broadcasting to use threads, that would be huge.

15 Likes

Abstracting from the specific code I would like to highlight the endogeneity problem here - generalising statments such as the premise of this topic is exactly the cause for such topics :slight_smile:

Yes, there are (probably many) instances where Julia is faster than Matlab (or other languages) but the premise that it should “usually” be faster is misleading (espeically in the context of this topic it appears as if it should “always” be faster). I state that because I was also misled in the sense that when I started learning Julia I was expecting all those gains in speed, even for relatively simple tasks exactly beacuse of the paper of Villaverde or the comparisons on the website.

If you did DSGE models with VFI as in the mentioned paper AND it came out slower than comparable languages it would indeed be interesting and puzzling. However, once you start doing other things it becomes much more complicated to predict which is faster.

I work in time series, especially bayesian time series econometrics and spend a lot of time waiting for my code. I tried switching to Julia only to find that it is slower to Matlab also in my instances. I then looked around and asked experienced Julia programmers (in person, but I think I also had some discussions here) for help and in my case it turned out it cannot get faster and it all depends on the BLAS in the background.

Remember also, that code might be slower for two reasons:

  1. the language is slower
  2. the code is not written well for the language.

No. 2 is BIG for me, I have over 13 years experience with Matlab and at best half a year in Julia. I am sure I write suboptimal Julia code because I am plagued with doing stuff necessery in Matlab but unneeded here and because I don’t know what is the best way to do stuff in Julia.To compare 1) objectively you need to write the code in the best way possible for each language and account for the time of writing the code in that manner.

All I am saying is that it appears that you think that Julia should be faster per se, exactly as I did and I think we should watch out as this sets expectations for new users.

12 Likes

Porting from matlab to julia is the most difficult one, since Matlab has put in crazy effort to make programming patterns that are toxic to all other languages super fast… :wink:

I usually say, that in almost all cases, Julia can get to machine performance like C and if it doesn’t, it is a bug.
That of course doesn’t talk about any effort needed to achieve this… Of course, if you have a language that is super slow, but which happens to offer a fast C library exactly tailored to your problem, your effort to get state of the art performance in that slow language will be 0, and non zero in any language that doesn’t have the same library handy :wink:
The only valid point about Julia in that context is, that given enough time and effort, it should (almost) always be possible to reach the best performance.

20 Likes

I actually think that the premise that Julia should “usually” be faster is entirely accurate (after accounting for compilation time). It just so happens that the case of everything being reduced to BLAS calls is one of a handful of cases where “usually” doesn’t apply. The other cases are usually when you’re doing something sufficiently easy that Numpy, Numba, Matlab JIT, etc. can optimize their respective languages’ code to close to C performance as well. These cases do happen, but I wouldn’t say they’re “usual.” Then again, there’s a certain selection bias. Many people using Julia are using Julia precisely because their problem is too hard for Python or Matlab to optimize effectively, and many people not using Julia stayed with Python or Matlab because that wasn’t the case for them.

(Also, recent developments have convinced me that it’s quite possible that someday there will be a Julia-based BLAS competitive with MKL, so even the BLAS case may get better!)

5 Likes

In that case can you help figure out why the option pricing example is slower in Julia than Matlab/Numpy/etc. @inline doesn’t seem to help.

1 Like

Re:https://github.com/vduarte/benchmarkingML/blob/master/LSMC/Julia.jl

const Spot = 36
const σ = 0.2
const n = 100000
const m = 10
const K = 40
const r = 0.06
const T = 1
const order = 25
const Δt = T / m

@StefanKarpinski If I make the above global variables const in the original code it runs more slowly.

@Albert_Zevelev If I make two versions of your code with and without const variables (as above) the version with the const globals is slower most of the time. However, your code is significantly faster than the original. (I agree with your commented out lines).

1 Like

This the reason I don’t really like the way people market julia. The way people talk about julia’s performance gives many people the impression that they can walk over, write their (sometimes horrific) matlab / python code in julia (while expecting minimal syntax / language differences) and see order of magnitude performance improvements.

Sometimes naïvely written julia is crazy fast. Sometimes it’s not. The difference is that in julia you actually can buckle down, look at a performance hotspot and tune it to C speeds without ever leaving julia.

High performance julia code can often be completely unrecognizable as ‘julia’ code and might be more reminiscent of Fortran or C++ than Python (though, I’d argue that it’s easier to learn to write high performance julia code if you already know julia than it is to learn to write Fortran or C++).

However the situation is still better than stuff like Python + Cython or Matlab + calling out Fortran routines. Even though some ‘high performance’ julia code is unrecognizable to humans, the compiler still sees it as just julia code. Having all code be seen by the same compiler enables certain optimizations (interprocedural optimizations) that aren’t possible in other paradigms, and more importantly, it enables much more generic code and allows people make amazing tools like Cassette.jl, or language wide automatic differentiation and have them just work on practically any package.

13 Likes