Julia slower than Matlab & Python? No

Does the function first_one just set every entry to zero exept the first positive one in every row?
If so, this seems like a rather complicated way to compute it. What about inplace (operating directly on the transpose)

function first_one!(x)
    for col in 1:size(x, 2)
        foundfirst = false
        @inbounds for row in 1:size(x, 1)
            if foundfirst || x[row, col] <= 0.
                x[row, col] = 0.
            else
                foundfirst = true
            end
        end
    end
    x
end

yes but your code doesn’t do the same thing, possibly a typo?

It operates on the columns, but I think it does the same thing otherwise:

julia> x = randn(4, 5);

julia> first_one(x')'
4×5 Adjoint{Float64,Array{Float64,2}}:
 0.533999  -0.0       0.0870617   1.82947  -0.0     
 0.0       -0.0      -0.0         0.0       0.926004
 0.0        1.26301  -0.0        -0.0       0.0     
 0.0       -0.0       0.0         0.0      -0.0     

julia> first_one!(x)
4×5 Array{Float64,2}:
 0.533999  0.0      0.0870617  1.82947  0.0     
 0.0       0.0      0.0        0.0      0.926004
 0.0       1.26301  0.0        0.0      0.0     
 0.0       0.0      0.0        0.0      0.0    

If you really want it to do the same thing it is pretty easy to change my code.

Btw. I think the original first_one function is a nice example of how complicated it can be to write vectorized code of something that is very easy to achieve with for loops.

2 Likes

You’re right. When I replace row w/ columns etc it works. It even runs an order of mag faster than first_one(). However the entire program is still slower than Matlab :expressionless:

1 Like

I think wherever possible, it would be better to use ifelse than where() style as defined in the benchmark code.

using BenchmarkTools

function where(cond, value_if_true, value_if_false)
    out = value_if_true .* cond + .!cond .* value_if_false
end

c = rand(Bool, 10^5);
a = rand(10^5)
b = rand(10^5)

@btime where(c, a, b) # 223.700 μs (6 allocations: 2.29 MiB)

@btime ifelse.(c, a, b) # 65.100 μs (4 allocations: 781.38 KiB)

So, if I replace where with ifelse plus put @strided in B[:,n] = @strided 2 .* x .* B[:,n - 1] .- B[:,n - 2] I can see some reduction in time on my Windows machine:

1702.0001411437988 (without such changes it is 2285.0000858306885)

Juno.@profiler shows that most of the time is spent on B[:,n] = @strided 2 .* x .* B[:,n - 1] .- B[:,n - 2] in chebyshev_basis() and where

while numpy version still win by slight number on the average
(benchmarkingML/python_numpy.py at master · vduarte/benchmarkingML · GitHub):

1610.6185913085938

3 Likes

Well, first_one was just one example where Matlab/numpy-style of vectorizing everything leads to slow code that I find more difficult to read than the version with for loops. There are probably still quite a few other unnecessary allocations. I guess already sprinkling a few @views in compute_price helps a bit (see below), but I am confident that you can get it even faster. It’s a good exercise to follow the performance tips of the julia manual carefully :smiley:.

function compute_price(Spot, σ, K, r, n, m, Δt, order)
    Random.seed!(0)
    S = zeros(n, m + 1)
    S[:, 1] .= Spot
    for t = 1:m
        @views S[:, t + 1] .= advance(S[:, t], r, σ, Δt, n)
    end
    discount = exp(-r * Δt)
    CFL = max.(0, K .- S)
    value = zeros(n, m)
    @. @views value[:, end] = CFL[:, end] * discount
    CV = zeros(n, m)
    for k = 1:m -1
        t = m - k
        t_next = t + 1
        @views XX = chebyshev_basis(scale(S[:, t_next]), order)
        @views YY = value[:, t_next]
        CV[:, t] .= ridge_regression(XX, YY, 100)
        @views value[:, t] = discount * w.(CFL[:, t_next] .> CV[:, t], CFL[:, t_next], value[:, t_next])
        #value[:, t] = discount * where(CFL[:, t_next] .> CV[:, t], CFL[:, t_next], value[:, t_next])
    end
    @views POF = ifelse.(CV .< CFL[:, 2:end], CFL[:, 2:end], 0 * CFL[:, 2:end])';
    FPOF = first_one!(POF)
    m_range = collect(range(0; stop=m-1, step = 1))
    dFPOF = @. (FPOF*exp(-r*m_range*Δt))   #(FPOF.*exp.(-r*m_range*Δt))
    PRICE = sum(dFPOF) / n  ##mean(dFPOF)
    return PRICE
end
2 Likes

I have not run the code, but if you manually force inlining of chebyshev_basis you could potentially alleviate the allocations of the views inside that function. Since those occur inside multiple loops they might add up

1 Like

Why not go one step further and make it mutating?

function chebyshev_basis!(x, B)
    B[:,2] .= x
    for n in range(3, stop = size(B,2))
        @. @views B[:,n] = 2 * x * B[:,n - 1] - B[:,n - 2]
    end
end

function scale!(x)
    xmin,xmax = extrema(x)
    a = 2. / (xmax - xmin)
    b = -0.5 * a * (xmin + xmax)
    @. x = a * x + b
end

with the following in compute_price:

[...]
 chebyshev_scratchpad = ones(n, order)
    for k = 1:m -1
        t = m - k
        t_next = t + 1

        stmp = view(S, :, t_next)
        scale!(stmp)
        chebyshev_basis!(stmp, chebyshev_scratchpad)
[...]

Mutating works because the first row always remains as 1, so we can reuse those and the rest is overwritten anyway.

That change alone gives me

5:  1076.220989227295
10:  1522.414207458496
25:  2239.0458583831787

vs. the original:

5:  1411.3590717315674
10:  1967.1061038970947
25:  3036.3519191741943
5 Likes

I just came to know that putting @views in front of function header does the job all at once. I’m curious why array indexing in Julia returns a copy by default rather than a view, unlike numpy.

1 Like

See this discussion:

TL;DR: it is not always better in terms of performance.

1 Like

It might still be worth forcing inlining of the function in case the inlining heuristics didn’t choose to do it. If the function is not inlined, you’ll still pay the allocations of the views.

3 Likes

I was more concerned with reducing allocations in general since the original function would allocate each time it was called - the mutating version claims there’s no allocations happening at all, suggesting the views are eliminated. Though removing the @views as well results in a ~50ms performance hit, presumably because of inbounds checks?

I didn’t notice that my laptop wasn’t hooked up to the power socket so the timings I posted earlier are misleading - here’s some more recent timings of both the original and the current modification I have on my machine:

# original:
5:  804.4960498809814
10:  1123.3201026916504
25:  2340.765953063965

# current:
5:  404.3850898742676
10:  537.7159118652344
25:  1070.1539516448975

I haven’t changed many things beside making more functions mutating, reuse views where possible and eliminate unnecessary broadcasting/looping by combining broadcasts into a single equivalent for loop. The only function still missing is ridge_regression, but I’m not quite sure if it’s possible to make that one mutating as well or if that would help at all, since so far I couldn’t get it to return the same results as the original did.


EDIT: Noticing one mistake later:

5:  392.29607582092285
10:  528.6240577697754
25:  1067.4798488616943

:slight_smile:

This is with ridge_regression! mutating only the last mul!, but it doesn’t help much. Seems very stable though.

3 Likes

This line allocates temporary results for X’X, for X’X+I and for X’Y. All those can be done inplace on storage arrays.

1 Like

Thanks, how?

Probably using mul! and ldiv!. If the width of X is small, this is unlikely to result in much performance gain tough.

1 Like

For what it’s worth, here’s what I got so far. @Albert_Zevelev please try it on your machine so you can compare the timings.

Check the comments for what has moved/changed or where I still think there’s improvements to be made.

Click for code!
using Statistics, LinearAlgebra, Random, BenchmarkTools

function chebyshev_basis!(x, B)
    B[:,2] .= x
    for n in range(3, stop = size(B,2))
        @. @views B[:,n] = 2. * x * B[:,n - 1] - B[:,n - 2]
    end
end

function ridge_regression!(out, X, Y, λ)
    # should be possible to also do inplace, somehow
    β = (X'X + λ * I) \ (X'Y)
    mul!(out, X, β)
end

function scale!(x)
    xmin,xmax = extrema(x)
    a = 2. / (xmax - xmin)
    b = -0.5 * a * (xmin + xmax)
    @. x = a * x + b
end

function advance!(out, randScratch, S, r, σ, Δt, n)
    dB = sqrt(Δt) * randn!(randScratch)
    @. out = S + r * S * Δt + σ * S * dB
end

function compute_price(Spot, σ, K, r, n, m, Δt, order)
    Random.seed!(0)
    
    S = zeros(n, m + 1)
    start_view = view(S,:,1)
    randScratch = zeros(n)
    fill!(start_view,Spot)
    for t = 1:m
        # we can reuse the views from the last iteration
        next_view = view(S, :, t+1)
        advance!(next_view, randScratch, start_view, r, σ, Δt, n)
        start_view = next_view 
    end
    
    discount = exp(-r * Δt)
    CFL = max.(0, K .- S)
    value = zeros(n, m)
    value[:, end] .= view(CFL ,: , size(CFL,2)) .* discount
    
    CV = zeros(n, m)
    chebyshev_scratchpad = ones(n, order)
    for k = 1:m -1
        t = m - k
        t_next = t + 1

        stmp = view(S, :, t_next)
        scale!(stmp)
        chebyshev_basis!(stmp, chebyshev_scratchpad)

        YY = view(value, :, t_next)
        cvview = view(CV, :, t)
        ridge_regression!(cvview, chebyshev_scratchpad, YY, 100)
        
        cflview = view(CFL, :, t_next)
        for i in axes(value, 1)
            value[i, t] = discount * ifelse(cflview[i] > cvview[i], cflview[i], YY[i])
        end
    end

    # inlined first_one! by just getting the elements we care about in the first place
    @views cflend = CFL[:, 2:end]
    POF = zeros(size(CV,2),size(CV,1))
    for r in axes(POF,2)
        for c in axes(POF,1)
            if CV[r,c] < cflend[r,c]
                if cflend[r,c] > 0
                    POF[c,r] = cflend[r,c]
                    break
                end
            end
        end
    end

    # I feel like there's still some speed left here since POF is big compared to what we care about
    m_range = range(0; stop=m-1, step = 1)
    dFPOF = @. (POF*exp(-r*m_range*Δt))

    PRICE = sum(dFPOF) / n
    return PRICE
end
######

function run()
    Spot = 36.0
    σ = 0.2
    n = 100000
    m = 10
    K = 40.0
    r = 0.06
    T = 1.
    Δt = T / m
    ε = 1e-2
    for order in [5, 10, 25]
        t0 = time();
        P = compute_price(Spot, σ, K, r, n, m, Δt, order);
        dP_dS = (compute_price(Spot + ε, σ, K, r, n, m, Δt, order) - P) / ε;
        dP_dσ = (compute_price(Spot, σ + ε, K, r, n, m, Δt, order) - P) / ε;
        dP_dK = (compute_price(Spot, σ, K + ε, r, n, m, Δt, order) - P) / ε;
        dP_dr = (compute_price(Spot, σ, K, r + ε, n, m, Δt, order) - P) / ε;
        t1 = time()
        out = (t1 - t0)
        println(order, ":  ", 1000 * out)
    end
end

isinteractive() || run()

I feel like this is getting pretty off topic though, so maybe this should be moved to its own thread.

The part with the final price computation looks for the first places in each row of CFL where CFL[i, j+1] > CV[i, j] and then does a summation over those terms. There is a much simpler way of writing it:

PRICE = 0.
@inbounds for i=1:n
    for j=1:m
        if CV[i, j] < CFL[i, j+1] && CFL[i, j+1] > 0. 
            PRICE += CFL[i, j+1]*exp(-r*(j-1)*Δt)
            break
        end
    end
end
PRICE /= n
1 Like

This subtly changes the result though :confused: 4.452350569585344 vs 4.452350569585376 for

Spot = 36.0
σ = 0.2
n = 100000
m = 10
K = 40.0
r = 0.06
T = 1.
Δt = T / m
compute_price(Spot, σ, K, r, n, m, Δt, 5)

When I combine @Sukera w/ @alvrg I get a big speed-up:

Now faster than Matlab, only slower than TF/PyTorch

7 Likes

How big do you want to go on the Ridge regression? And what kind of sparsity does it have? Also, are the matrices structured in any way?

Choosing better algorithms is usually the key to winning performance battles. Sparse or krylov algorithms can dominate for big problems, especially with good preconditions. I don’t know the algorithms myself, but I would scan the Krylov.jl and IterativeSolvers.jl to see if a regularized solvers map to the Ridge regression.

4 Likes