Julia slower than Matlab & Python? No

Disclaimer: this is getting very much off topic and should maybe be split to a separate thread.

I am actually surprised that Strided manages to speed up such a simple operation by involving more threads. I would have thought that this operation would be mostly memory bound.

Anyway, there are two settings in which Strided can speed up array operations:

  • many computations per iteration, by using multithreading
  • memory unfriendly access patterns (permutdims and friends), even without threads

However, the microkernel in Strided is not very well optimized, and would certainly benefit from the work of @elrod on exploiting vectorization, to yield even further speedups. Unfortunately I don’t think I can just plug in a simple @avx decoration in the current kernel, because at that point the operation is already disected into manually incrementing (linear) indices with appropriate strides. I am unfortunately not very familiar with the low level vector instructions and how to call them from within Julia.

Maybe some collaboration could be fruitful, @elrod?

6 Likes

After several unsuccessful optimization attempts, I noticed something.
In the “warmup” run compute_price(Spot, σ, K, r) is called on (Int, Float64, Int, Float64).
In dP_dS = (compute_price(Spot + ε, σ, K, r) - P) / ε, the arguments of compute_price are (Float64, Float64, Int, Float64) etc.
So, the time reported by the script still includes compilation time.

1 Like

I tried changing Spot, σ, K, r so they are all float, but still no speed up.
Any luck @Vasily_Pisarev?

1 Like

Some speedup by consting all globals and replacing line 20 by

@. @views B[:,n] = 2 * x * B[:,n - 1] - B[:,n - 2]

But that still ends up slower than numpy.
From the CPU usage I’d conclude that “idiomatic” numpy is just better at passing the heavy lifting to BLAS than “idiomatic” Julia.

1 Like

I think a lot of it is that multithreading is only a few months old in Julia, so not much of Base takes advantage yet. If we could get broadcasting to use threads, that would be huge.

15 Likes

Abstracting from the specific code I would like to highlight the endogeneity problem here - generalising statments such as the premise of this topic is exactly the cause for such topics :slight_smile:

Yes, there are (probably many) instances where Julia is faster than Matlab (or other languages) but the premise that it should “usually” be faster is misleading (espeically in the context of this topic it appears as if it should “always” be faster). I state that because I was also misled in the sense that when I started learning Julia I was expecting all those gains in speed, even for relatively simple tasks exactly beacuse of the paper of Villaverde or the comparisons on the website.

If you did DSGE models with VFI as in the mentioned paper AND it came out slower than comparable languages it would indeed be interesting and puzzling. However, once you start doing other things it becomes much more complicated to predict which is faster.

I work in time series, especially bayesian time series econometrics and spend a lot of time waiting for my code. I tried switching to Julia only to find that it is slower to Matlab also in my instances. I then looked around and asked experienced Julia programmers (in person, but I think I also had some discussions here) for help and in my case it turned out it cannot get faster and it all depends on the BLAS in the background.

Remember also, that code might be slower for two reasons:

  1. the language is slower
  2. the code is not written well for the language.

No. 2 is BIG for me, I have over 13 years experience with Matlab and at best half a year in Julia. I am sure I write suboptimal Julia code because I am plagued with doing stuff necessery in Matlab but unneeded here and because I don’t know what is the best way to do stuff in Julia.To compare 1) objectively you need to write the code in the best way possible for each language and account for the time of writing the code in that manner.

All I am saying is that it appears that you think that Julia should be faster per se, exactly as I did and I think we should watch out as this sets expectations for new users.

12 Likes

Porting from matlab to julia is the most difficult one, since Matlab has put in crazy effort to make programming patterns that are toxic to all other languages super fast… :wink:

I usually say, that in almost all cases, Julia can get to machine performance like C and if it doesn’t, it is a bug.
That of course doesn’t talk about any effort needed to achieve this… Of course, if you have a language that is super slow, but which happens to offer a fast C library exactly tailored to your problem, your effort to get state of the art performance in that slow language will be 0, and non zero in any language that doesn’t have the same library handy :wink:
The only valid point about Julia in that context is, that given enough time and effort, it should (almost) always be possible to reach the best performance.

20 Likes

I actually think that the premise that Julia should “usually” be faster is entirely accurate (after accounting for compilation time). It just so happens that the case of everything being reduced to BLAS calls is one of a handful of cases where “usually” doesn’t apply. The other cases are usually when you’re doing something sufficiently easy that Numpy, Numba, Matlab JIT, etc. can optimize their respective languages’ code to close to C performance as well. These cases do happen, but I wouldn’t say they’re “usual.” Then again, there’s a certain selection bias. Many people using Julia are using Julia precisely because their problem is too hard for Python or Matlab to optimize effectively, and many people not using Julia stayed with Python or Matlab because that wasn’t the case for them.

(Also, recent developments have convinced me that it’s quite possible that someday there will be a Julia-based BLAS competitive with MKL, so even the BLAS case may get better!)

5 Likes

In that case can you help figure out why the option pricing example is slower in Julia than Matlab/Numpy/etc. @inline doesn’t seem to help.

1 Like

Re:https://github.com/vduarte/benchmarkingML/blob/master/LSMC/Julia.jl

const Spot = 36
const σ = 0.2
const n = 100000
const m = 10
const K = 40
const r = 0.06
const T = 1
const order = 25
const Δt = T / m

@StefanKarpinski If I make the above global variables const in the original code it runs more slowly.

@Albert_Zevelev If I make two versions of your code with and without const variables (as above) the version with the const globals is slower most of the time. However, your code is significantly faster than the original. (I agree with your commented out lines).

1 Like

This the reason I don’t really like the way people market julia. The way people talk about julia’s performance gives many people the impression that they can walk over, write their (sometimes horrific) matlab / python code in julia (while expecting minimal syntax / language differences) and see order of magnitude performance improvements.

Sometimes naïvely written julia is crazy fast. Sometimes it’s not. The difference is that in julia you actually can buckle down, look at a performance hotspot and tune it to C speeds without ever leaving julia.

High performance julia code can often be completely unrecognizable as ‘julia’ code and might be more reminiscent of Fortran or C++ than Python (though, I’d argue that it’s easier to learn to write high performance julia code if you already know julia than it is to learn to write Fortran or C++).

However the situation is still better than stuff like Python + Cython or Matlab + calling out Fortran routines. Even though some ‘high performance’ julia code is unrecognizable to humans, the compiler still sees it as just julia code. Having all code be seen by the same compiler enables certain optimizations (interprocedural optimizations) that aren’t possible in other paradigms, and more importantly, it enables much more generic code and allows people make amazing tools like Cassette.jl, or language wide automatic differentiation and have them just work on practically any package.

13 Likes

Does the function first_one just set every entry to zero exept the first positive one in every row?
If so, this seems like a rather complicated way to compute it. What about inplace (operating directly on the transpose)

function first_one!(x)
    for col in 1:size(x, 2)
        foundfirst = false
        @inbounds for row in 1:size(x, 1)
            if foundfirst || x[row, col] <= 0.
                x[row, col] = 0.
            else
                foundfirst = true
            end
        end
    end
    x
end

yes but your code doesn’t do the same thing, possibly a typo?

It operates on the columns, but I think it does the same thing otherwise:

julia> x = randn(4, 5);

julia> first_one(x')'
4×5 Adjoint{Float64,Array{Float64,2}}:
 0.533999  -0.0       0.0870617   1.82947  -0.0     
 0.0       -0.0      -0.0         0.0       0.926004
 0.0        1.26301  -0.0        -0.0       0.0     
 0.0       -0.0       0.0         0.0      -0.0     

julia> first_one!(x)
4×5 Array{Float64,2}:
 0.533999  0.0      0.0870617  1.82947  0.0     
 0.0       0.0      0.0        0.0      0.926004
 0.0       1.26301  0.0        0.0      0.0     
 0.0       0.0      0.0        0.0      0.0    

If you really want it to do the same thing it is pretty easy to change my code.

Btw. I think the original first_one function is a nice example of how complicated it can be to write vectorized code of something that is very easy to achieve with for loops.

2 Likes

You’re right. When I replace row w/ columns etc it works. It even runs an order of mag faster than first_one(). However the entire program is still slower than Matlab :expressionless:

1 Like

I think wherever possible, it would be better to use ifelse than where() style as defined in the benchmark code.

using BenchmarkTools

function where(cond, value_if_true, value_if_false)
    out = value_if_true .* cond + .!cond .* value_if_false
end

c = rand(Bool, 10^5);
a = rand(10^5)
b = rand(10^5)

@btime where(c, a, b) # 223.700 μs (6 allocations: 2.29 MiB)

@btime ifelse.(c, a, b) # 65.100 μs (4 allocations: 781.38 KiB)

So, if I replace where with ifelse plus put @strided in B[:,n] = @strided 2 .* x .* B[:,n - 1] .- B[:,n - 2] I can see some reduction in time on my Windows machine:

1702.0001411437988 (without such changes it is 2285.0000858306885)

Juno.@profiler shows that most of the time is spent on B[:,n] = @strided 2 .* x .* B[:,n - 1] .- B[:,n - 2] in chebyshev_basis() and where

while numpy version still win by slight number on the average
(benchmarkingML/python_numpy.py at master · vduarte/benchmarkingML · GitHub):

1610.6185913085938

3 Likes

Well, first_one was just one example where Matlab/numpy-style of vectorizing everything leads to slow code that I find more difficult to read than the version with for loops. There are probably still quite a few other unnecessary allocations. I guess already sprinkling a few @views in compute_price helps a bit (see below), but I am confident that you can get it even faster. It’s a good exercise to follow the performance tips of the julia manual carefully :smiley:.

function compute_price(Spot, σ, K, r, n, m, Δt, order)
    Random.seed!(0)
    S = zeros(n, m + 1)
    S[:, 1] .= Spot
    for t = 1:m
        @views S[:, t + 1] .= advance(S[:, t], r, σ, Δt, n)
    end
    discount = exp(-r * Δt)
    CFL = max.(0, K .- S)
    value = zeros(n, m)
    @. @views value[:, end] = CFL[:, end] * discount
    CV = zeros(n, m)
    for k = 1:m -1
        t = m - k
        t_next = t + 1
        @views XX = chebyshev_basis(scale(S[:, t_next]), order)
        @views YY = value[:, t_next]
        CV[:, t] .= ridge_regression(XX, YY, 100)
        @views value[:, t] = discount * w.(CFL[:, t_next] .> CV[:, t], CFL[:, t_next], value[:, t_next])
        #value[:, t] = discount * where(CFL[:, t_next] .> CV[:, t], CFL[:, t_next], value[:, t_next])
    end
    @views POF = ifelse.(CV .< CFL[:, 2:end], CFL[:, 2:end], 0 * CFL[:, 2:end])';
    FPOF = first_one!(POF)
    m_range = collect(range(0; stop=m-1, step = 1))
    dFPOF = @. (FPOF*exp(-r*m_range*Δt))   #(FPOF.*exp.(-r*m_range*Δt))
    PRICE = sum(dFPOF) / n  ##mean(dFPOF)
    return PRICE
end
2 Likes

I have not run the code, but if you manually force inlining of chebyshev_basis you could potentially alleviate the allocations of the views inside that function. Since those occur inside multiple loops they might add up

1 Like

Why not go one step further and make it mutating?

function chebyshev_basis!(x, B)
    B[:,2] .= x
    for n in range(3, stop = size(B,2))
        @. @views B[:,n] = 2 * x * B[:,n - 1] - B[:,n - 2]
    end
end

function scale!(x)
    xmin,xmax = extrema(x)
    a = 2. / (xmax - xmin)
    b = -0.5 * a * (xmin + xmax)
    @. x = a * x + b
end

with the following in compute_price:

[...]
 chebyshev_scratchpad = ones(n, order)
    for k = 1:m -1
        t = m - k
        t_next = t + 1

        stmp = view(S, :, t_next)
        scale!(stmp)
        chebyshev_basis!(stmp, chebyshev_scratchpad)
[...]

Mutating works because the first row always remains as 1, so we can reuse those and the rest is overwritten anyway.

That change alone gives me

5:  1076.220989227295
10:  1522.414207458496
25:  2239.0458583831787

vs. the original:

5:  1411.3590717315674
10:  1967.1061038970947
25:  3036.3519191741943
5 Likes

I just came to know that putting @views in front of function header does the job all at once. I’m curious why array indexing in Julia returns a copy by default rather than a view, unlike numpy.

1 Like