Julia slower than Matlab & Python? No

@Albert_Zevelev looked a little bit at TensorFlow and the ridge regression. The code with TensorFlow vs. the others is likely comparing different algorithms. If the loss function is setup manually, then the standard ML loss minimization would be some sort of iterative method.

For regularized least squares solutions with big setups, there are far better ways to do it in Julia where you could even match the TensorFlow algorithm exactly if you wished (or even find more appropriate algorithms). It might even be less code and allow GPU access! as @Mason has asked for.

2 Likes

@jlperla thanks for the suggestions.

It looks like the authors took vectorized code from Matlab & Numpy, then translated it to Julia & found Julia under-performs.

Using above suggestions, I’ve found if you simply replaced the awkward/complicated vectorized parts of the code w/ simpler/more intuitive loops then Julia outperforms all the other languages except TF & PyTorch (which I don’t have right now).

If I add some of the other suggestions given here, the code becomes harder to read & only ~20% faster.

It’s a miracle that in Julia the simpler, more intuitive, & less bug prone code is also faster than numpy/matlab.

8 Likes

That is great. Can you post your full code (including the ridge_regression and chebyshev_basis functions to look at for people (with and without the 20% improvement). The general feeling that I (and I know Jesus) have is that we are better off showcasing a clean and intuitive code, and that the last 20% is rarely worth the trouble. But an order of magnitude is!

I’m pretty sure they explicitly mention in the paper that this is what they’re measuring: performance for code written like that.

That doesn’t mean you have to avoid basic performance tips on globals etc of course…

2 Likes

I think the biggest takeaway for me from this saga is that at least for situations like this, PyTorch and TensorFlow’s approach of taking naive code and building up a computational graph with it (and then doing analysis on that graph to fuse operations, and efficiently pre-allocate memory, etc.) is quite powerful.

It’d be interesting to see if we could support something like that without the sort of gigantic manpower required by PyTorch / TensorFlow. Some interface like

@graph_analyze begin
    # big chunk of intertwined array operations
end

that goes through and detects cases where views should be used or intermediary arrays should be pre-allocated, see if it can prove that certain operations don’t need bounds checking, etc. could be quite interesting and useful.

9 Likes

That’s essentially ModelingToolkit took and how we use it in DiffEq

2 Likes

Can modelling toolkit do the sorts of things I described above? I was under the impression that it was more about making a DSL for DiffEq stuff. It doesn’t really seem to be concerned with parsing, analyzing and optimizing regular Julia code unless I’m misunderstanding.

It’s a static computational graph for mathematical problems. The README even shows a non DiffEq case. We don’t have everything of course, it’s quite new, but it does things like automatically build functions for sparse Jacobians and all of that. We are currently working to make it autoparallelize code too. We plan to do other rewrite rules, which would then be optimizations.

9 Likes

Are there plans for it to be applied to regular Julia code instead of a DSL?

1 Like

@jlperla code combined from various users here on Discourse
Julia Original
Julia Simple
Julia Fastest

“Simple” is much easier to understand than the original code & for every order size is faster than Matlab/Numpy/R.
“Fastest” is even faster, but I don’t find it easy to follow.
PS: I’m sure “Fastest” can be optimized a lot more
PPS: I got PyTorch to work on my mac (not TF yet)
“Fastest” is faster than PyTorch which is faster than “Simple”

Speed on my machine.
image

14 Likes

The difference between the simple and fastest version you posted is almost completely covered by the Performance Tips from the julia manual. I’ve compiled a list of tips relevant to the speed difference between simple and fastest:

I’ve also annotated the fastest code you linked to in hopes of better explaining the difference between the original code and the fastest version. If you have any questions or if anything about my comments is unclear, please feel free to ask!

Annotated Code
using Statistics, LinearAlgebra, Random, BenchmarkTools

# This function modifies B, i.e. reuses the memory given to us. Allocating new memory
# is slow, so reusing memory is a good idea. Additionally, since slices in julia
# copy by default, using @views allows us to directly index into B for the computation
# instead of having to copy the memory and index into the copy.
# Using .= for the initial values of B[:,2] also allows us to directly write contents
# from x into B and depending on how broadcasting is implemented for those arguments,
# writing may be done in parallel.
function chebyshev_basis!(x, B)
    B[:,2] .= x
    for n in range(3, stop = size(B,2))
        @. @views B[:,n] = 2. * x * B[:,n - 1] - B[:,n - 2]
    end
end

# ridge_regression also used to allocate its return matrix, but if we take a look at
# the caller of this function, we notice that the memory given to this function can be
# reused. Thus, providing an already allocated output matrix allows us to save a
# potentially big chunk of memory.
function ridge_regression!(out, X, Y, λ)
    # This should be possible to also do inplace, somehow
    β = (X'X + λ * I) \ (X'Y)
    mul!(out, X, β)
end

# This function had an unnecessary conversion from Int to Float64. Because all input
# we use is always Float64, using the floating point literal 2. instead of the integer
# literal 2 we can get rid of that conversion.
# Additionally, creating the output in the function is unnecessary, since we're never
# reusing the input at the callers side. Using the memory given to us instead of
# allocating new memory is more efficient because allocations take time.
function scale!(x)
    xmin,xmax = extrema(x)
    a = 2. / (xmax - xmin)
    b = -0.5 * a * (xmin + xmax)
    @. x = a * x + b
end

# advance was called a lot and allocated a lot of memory during the creation
# of the random vector, as well as during calculation of the output matrix.
# Preallocating both matrices and reusing the memory between calls removes those
# allocations and speeds the function up substantially.
function advance!(out, randScratch, S, r, σ, Δt, n)
    dB = sqrt(Δt) * randn!(randScratch)
    @. out = S + r * S * Δt + σ * S * dB
end

function compute_price(Spot, σ, K, r, n, m, Δt, order)
    Random.seed!(0)

    # We first have to create the working memory 
    S = zeros(n, m + 1)

    # view(S,:,1) is equivalent to @view S[:,1] - we get a lightweight
    # view of the underlying matrix without copying the data. Writing
    # to this view writes to S, but the view into S can be reused!
    start_view = view(S,:,1)

    # This creates the scratchpad/scratch memory for the random values used in advance
    randScratch = zeros(n)

    # As described above, writing to start_view writes to S. Using `fill!` is
    # faster than creating a new vector, multiplying the vector by `Spot`
    # and assigning the result to S[:,1]. We're saving an allocation again.
    fill!(start_view,Spot)

    # This loop makes use of views into existing memory to allow `advance!` to
    # directly write to S.
    for t = 1:m
        next_view = view(S, :, t+1)
        advance!(next_view, randScratch, start_view, r, σ, Δt, n)
        # we can reuse the views from the last iteration
        start_view = next_view
    end

    # The next few lines have stayed roughly the same, except for the
    # initialisation of value[:,end]. Again, we don't want to create a copy of
    # CFL[:,end], so we instead create a lightweight view into the existing memory
    # and use that to calculate the values for value[:,end].
    discount = exp(-r * Δt)
    CFL = max.(0, K .- S)
    value = zeros(n, m)
    value[:, end] .= view(CFL ,: , size(CFL,2)) .* discount

    # Similar to the scratch memory for the random values from above, we can
    # create scratch memory for `chebyshev_basis!` and reuse the memory between loops.
    CV = zeros(n, m)
    chebyshev_scratchpad = ones(n, order)
    for k = 1:m -1
        t = m - k
        t_next = t + 1

        # We create the current view (again, no copying of data from the underlying S!)
        # and use `scale!` to scale the values _in place, not allocating any new memory_.
        # Afterwards, we pass that view into `chebyshev_basis!` together with the scratch
        # memory to compute the needed values _in place_.  
        stmp = view(S, :, t_next)
        scale!(stmp)
        chebyshev_basis!(stmp, chebyshev_scratchpad)

        # By the same principle, we create a view into `value` and `CV`, using the result
        # from the `chebyshev_basis!` computation to compute the ridge_regression
        # _in place_, allocating as little as possible and reusing as much memory as possible.
        # The original code allocated a lot of memory on almost every line during this part,
        # which made it very slow.
        YY = view(value, :, t_next)
        cvview = view(CV, :, t)
        ridge_regression!(cvview, chebyshev_scratchpad, YY, 100)

        # Another view to once more save allocations and to not have to copy all the data
        # into a new vector. 
        # Looping over the view and using the function ifelse is faster than having to create
        # the copies and slices to facilitate broadcast because we once again don't have to
        # allocate anything. Everything we need is already here.
        cflview = view(CFL, :, t_next)
        for i in axes(value, 1)
            value[i, t] = discount * ifelse(cflview[i] > cvview[i], cflview[i], YY[i])
        end
    end

    # For the last part the original code created a lot of intermediary matrices and
    # transposes of those matrices, which is very expensive because the memory was not
    # being reused at all. Looping over the relevant indices directly and aggregating
    # the desired values into `PRICE` directly is much faster, since we only have to read
    # from the existing memory and add into a known location. Dividing by `n` afterwards
    # also happens at that known location, so that's fast.
    # The speedup here is mostly due to algorithmic improvements, since collapsing the
    # creation and looping of `POF`, `FPOF` and `dFPOF` only to loop over `dFPOF` with
    # `sum(dFPOF)` again is a lot of unnecessary looping.  Just looping once fixes that.
    PRICE = 0.
    @inbounds for i=1:n
        for j=1:m
            if CV[i, j] < CFL[i, j+1] && CFL[i, j+1] > 0.
                PRICE += CFL[i, j+1]*exp(-r*(j-1)*Δt)
                break
            end
        end
    end
    PRICE /= n
#    return PRICE
end
######

# since the benchmarking code is wrapped in the run() function below, 
# we don't need to do "warmup" because the code will be compiled before it is run.
# Executing run() by itself will be enough to not include compile time of computer_price().
# Furthermore, a function is compiled for different argument _types_, not _values_. Executing
# a function with different argument _values_ contributes nothing to "warmup" of the compiler.
compute_price(36., .2, 40., .06, 100000, 10, 1/10, 5)
@benchmark compute_price(36., .2, 40., .06, 100000, 10, 1/10, 5)

# The CPU will have cached some of the values calculated after one execution of `run()`,
# so executing `run()` multiple times may still provide some speedup. This is not exclusive
# to julia. 
# Putting all the benchmarking code into a single function allows the compiler to propagate thoes
# constant numbers deep down into each and every function.
function run()
    Spot = 36.0
    σ = 0.2
    n = 100000
    m = 10
    K = 40.0
    r = 0.06
    T = 1.
    Δt = T / m
    ε = 1e-2
    for order in [5, 10, 25, 50]
        t0 = time();
        P = compute_price(Spot, σ, K, r, n, m, Δt, order);
        dP_dS = (compute_price(Spot + ε, σ, K, r, n, m, Δt, order) - P) / ε;
        dP_dσ = (compute_price(Spot, σ + ε, K, r, n, m, Δt, order) - P) / ε;
        dP_dK = (compute_price(Spot, σ, K + ε, r, n, m, Δt, order) - P) / ε;
        dP_dr = (compute_price(Spot, σ, K, r + ε, n, m, Δt, order) - P) / ε;
        t1 = time()
        out = (t1 - t0)
        println(order, ":  ", 1000 * out)
    end
end

isinteractive() || run()
# This call is not really necessary - running `julia file.jl` where file.jl is the
# file this code is stored in will execute run() via the line above.
# The `isinteractive() ||` part just prevents `run()` from being executed when
# including this file on the REPL.
run()
9 Likes

What do you mean? It works on normal Julia code. modelingtoolkitize?

https://docs.juliadiffeq.org/latest/tutorials/advanced_ode_example/#Automatic-Derivation-of-Jacobian-Functions-1

2 Likes

Great.

One thing to try, depending on your setup, is whether GitHub - JuliaLinearAlgebra/MKL.jl: Intel MKL linear algebra backend for Julia speeds things up, just so you are comparing apples-and-apples against numpy/matlab. It may not help, but it is easy to try. @ChrisRackauckas found a tweak to openblas which may make it comparable to MKL, but it wouldn’t be implemented in Julia yet.

It will be interesting to see if TF is faster. If so, we would need to figure out why it is using a different algorithm. There is no fundamental reason (maybe aliasing behavior if it interprets the code?) that TF should be faster on a CPU.

It’s just not automatic. It’s easy to do yourself: LinearAlgebra.BLAS.set_num_threads(n) where n is the correct number (the number of cores). The issue is that the automatic number is incorrect.

4 Likes

Thanks but I can’t get MKL to work on my Macbook:
https://github.com/JuliaComputing/MKL.jl/issues/26

@Albert_Zevelev Try Chris’s suggestion. Basically, the num_threads of openblas is set to a number which slows things down. @ChrisRackauckas What is the heuristic that makes sometimes makes it as fast as MKL? The physical number of cores?

1 Like

I mean something that would be relevant or useful to this post

I had never heard of ModelingToolkit.modelingtoolkitize before (it’s not in the ModelingToolkit docs / tutorial), but it also doesn’t seem to be what I’m talking about. It explicitly only operates on DiffEqBase.ODEProblem, not arbitrary functions.

The point I was making above would be that it’d be neat if there were tools in julia for doing analysis on arbitrary julia code and look for common performance traps and try to fix them like what TensorFlow and PyTorch are doing here. Maybe ModelingToolkit is close to doing that, but it doesn’t seem to do it yet.

It’s set to the number of logical threads, up to a maximum of 8, but should instead be set to the number of physical cores.

Most people have quad cores with hyperthreading, and end up getting the maximum since there are 8 logical threads. Instead set it to 4.

Traceur.jl does that, but does not fix them.

What are you even talking about? I showed you a function that works on an ODE by transforming a generic Julia code into a computational graph and analyze it. Why not just do that on any Julia code?

For example, FowardDiff2 shows MTK inside of ForwardDiff2 for analytical derivatives:

julia> using ModelingToolkit

julia> @variables x[1:3] v[1:3]
(Operation[x₁, x₂, x₃], Operation[v₁, v₂, v₃])

julia> D(sin)(x[1]) * 11
cos(x₁) * 11

julia> D(prod)(x) * I # gradient
1×3 Adjoint{Operation,StaticArrays.SArray{Tuple{3},Operation,1,3}} with indices SOneTo(1)×SOneTo(3):
 conj((identity(1) * x₂ + x₁ * identity(0)) * x₃ + (x₁ * x₂) * identity(0))  …  conj((identity(0) * x₂ + x₁ * identity(0)) * x₃ + (x₁ * x₂) * 1)

julia> D(cumsum)(x) * I # Jacobian
3×3 Adjoint{Operation,Array{Expression,2}}:
                               conj(1)                      conj(identity(0))                      conj(identity(0))
                 conj(identity(0) + 1)                  conj(1 + identity(0))        conj(identity(0) + identity(0))
 conj((identity(0) + identity(0)) + 1)  conj((identity(0) + 1) + identity(0))  conj((1 + identity(0)) + identity(0))

julia> D(cumsum)(x) * [1, 2, 3] # Jacobian-vector product
3-element Array{Int64,1}:
 1
 3
 6

julia> D(cumsum)(x) * v # Jacobian-vector product
3-element Array{Operation,1}:
             v₁
        v₂ + v₁
 (v₃ + v₂) + v₁

julia> DI(DI(prod))(x) # Hessian
3×3 StaticArrays.SArray{Tuple{3,3},Operation,2,9} with indices SOneTo(3)×SOneTo(3):
           conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(1) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(1) * identity(identity(0)) + x₁ * 0) + (identity(1) * identity(0) + x₂ * 0))))  …            conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(1) * x₂ + x₁ * identity(0)) * identity(1) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(1) * identity(0) + x₂ * 0))))
           conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * 1) * identity(0) + x₃ * ((identity(1) * identity(1) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0))))               conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * 1) * identity(1) + x₃ * ((identity(0) * identity(1) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0))))
 conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(1) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(1) * identity(identity(0)) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0))))     conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(1) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * identity(0)) * identity(1) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0))))

julia> DI(DI(prod))(x) * v # Hessian-vector product
3-element StaticArrays.SArray{Tuple{3},Operation,1,3} with indices SOneTo(3):
                               (conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(1) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(1) * identity(identity(0)) + x₁ * 0) + (identity(1) * identity(0) + x₂ * 0)))) * v₁ + conj(((identity(0) * x₂ + x₁ * 1) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(1) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(1) * identity(1) + x₂ * 0)))) * v₂) + conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(1) * x₂ + x₁ * identity(0)) * identity(1) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(1) * identity(0) + x₂ * 0)))) * v₃
                               (conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * 1) * identity(0) + x₃ * ((identity(1) * identity(1) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0)))) * v₁ + conj(((identity(0) * x₂ + x₁ * 1) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * 1) * identity(0) + x₃ * ((identity(0) * identity(1) + x₁ * 0) + (identity(identity(0)) * identity(1) + x₂ * 0)))) * v₂) + conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(identity(0)) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * 1) * identity(1) + x₃ * ((identity(0) * identity(1) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0)))) * v₃
 (conj(((identity(1) * x₂ + x₁ * identity(0)) * identity(1) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(1) * identity(identity(0)) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0)))) * v₁ + conj(((identity(0) * x₂ + x₁ * 1) * identity(1) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * identity(0)) * identity(0) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(identity(0)) * identity(1) + x₂ * 0)))) * v₂) + conj(((identity(0) * x₂ + x₁ * identity(0)) * identity(1) + (x₁ * x₂) * 0) + ((identity(identity(0)) * x₂ + x₁ * identity(0)) * identity(1) + x₃ * ((identity(0) * identity(identity(0)) + x₁ * 0) + (identity(identity(0)) * identity(0) + x₂ * 0)))) * v₃

Use build_function on that output to make it build a fast Julia function for the Hessian:

(BTW, this is the function we are optimizing for automatic parallelism, so this example will get automatic multithreading in the near future).

It’s literally just a static computational graph language. It seems like you really really don’t understand it, so you might want to look into it more before commenting on what it can do…

Literally as simple as:

  1. Make variables
  2. Put them into normal Julia code
  3. Static graph comes out
  4. Do things with the static graph
5 Likes