Julia slower than Matlab & Python? No

Albert_Zevelev · January 8, 2020, 11:57pm

A recent paper ML Software & Hardware for Econ finds Julia is often slower than Matlab & Numpy.
Code for Option Pricing is here.
Code for Dynamic Programming is here.

Usually, writing Julia code similarly leads to enormous speed-up.
How can we tweak their code for a more accurate comparison???

Update:
Dynamic Programming: Stefan, Tim, Mason tweaked the code & Julia is faster than Matlab/Python/R. Julia is faster than C++ in all but the two smallest grids.

Option Pricing: if you replace the awkward vectorized parts of the code, w/ simpler more intuitive loops, the Julia code easily outperforms the others (except TF/PyTorch which I don’t have right now).

In the paper they also run PyTorch & TensorFlow on GPU.
@Mason asks if anyone can speed this up w/ Julia on a GPU as well??? (I think this would be awesome bc Julia is known to be great on GPU…)

Mason · January 9, 2020, 12:03am

Can this be correct???

It’s possible to write arbitrarily slow code in any language if you write the code in a sufficiently preserve way. In fact, it’s much easier to write slow code than fast code.

Albert_Zevelev · January 9, 2020, 12:07am

Yeah but it looks like they tried to write the code similarly across languages.

Usually, writing Julia code similarly leads to enormous speed-up.
There must be something wrong. It’d be great to let them know before they publish it soon.

longemen3000 · January 9, 2020, 12:21am

The code, with some comments of my part:

using DelimitedFiles, Statistics

# Load grid for log(y) and transition matrix
logy_grid = readdlm("logy_grid.txt")[:] 
Py = readdlm("P.txt")
#those 2 variables should be const

function main(nB=351, repeats=500)
    β = .953
    γ = 2.
    r = 0.017
    θ = 0.282
    ny = size(logy_grid, 1)

    Bgrid = LinRange(-.45, .45, nB)
    ygrid = exp.(logy_grid)

    ymean = mean(ygrid) .+ 0 * ygrid
    def_y = min(0.969 * ymean, ygrid)

    Vd = zeros(ny, 1)
    Vc = zeros(ny, nB)
    V = zeros(ny, nB)
    Q = ones(ny, nB) * .95

    y = reshape(ygrid, (ny, 1, 1))
    B = reshape(Bgrid, (1, nB, 1))
    Bnext = reshape(Bgrid, (1, 1, nB))

    zero_ind = Int(ceil(nB / 2))

    function u(c, γ) #add the broadcast in the call? 
        return c.^(1 - γ) / (1 - γ)
    end


    t0 = time()
    function iterate(V, Vc, Vd, Q)
        EV = Py * V #array created, it should be preallocated
        EVd = Py * Vd#array created
        EVc = Py * Vc#array created

        Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
#calling a[:] allocates a new array
        Vd_target = reshape(Vd_target, (ny, 1))

        Qnext = reshape(Q, (ny, 1, nB))

        c = @. y .- Qnext .* Bnext .+ B #array created
        c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
        EV = reshape(EV, (ny, 1, nB))
        m =  @. u(c, γ) .+ β * EV #array created
        Vc_target = reshape(maximum(m, dims=3), (ny, nB))

        Vd_compat = Vd * ones(1, nB) #array created
        default_states = float(Vd_compat .> Vc)
        default_prob = Py * default_states #array created
        Q_target = (1. .- default_prob) ./ (1 + r)

        V_target = max.(Vc, Vd_compat) #array created

        return V_target, Vc_target, Vd_target, Q_target

    end

    iterate(V, Vc, Vd, Q)  # warmup
    t0 = time()
    for iteration in 1:repeats
        V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
    end
    t1 = time()
    out = (t1 - t0) / repeats

end

print(1000 * main(1351, 10))

In short, a great speedup can be achieved with just changing some calls with implace versions and preallocating the arrays before the iterate definition

aaowens · January 9, 2020, 12:28am

It looks like this code is mostly matrix and vector operations, which explains how they have a PyTorch version. It’s not clear that Julia should be any better at this, especially since Matlab uses multithreading by default and Julia doesn’t except for BLAS operations.

Albert_Zevelev · January 9, 2020, 12:31am

I see. But does that help explain why Julia is outperformed by regular Python/numpy?

jling · January 9, 2020, 12:32am

numpy is basically C, if the overhead of the python interperater is minimal, numpy should be as fast as C since it’s only ~function call slower

Oscar_Smith · January 9, 2020, 12:42am

A cursory look through the code also suggests that they are using 1 in a bunch of places where 1.0 would be better. I’m not sure how much performance difference it makes here though.

Zach_Christensen · January 9, 2020, 2:55am

I don’t care to figure out exactly what they’re trying to do so I apologize for the half-hearted answer, but a lot of stuff there smells bad. This one in particular just seems poorly planned.

default_states = float(Vd_compat .> Vc)

Also, similar syntax is a pretty bad standard for equating similar levels of effort for each set of code. They should seriously think about using the number of lines of code as a better standard. Similar syntax only ensures easy reading and implementation for those that use that type of syntax.

P.S.
Their R code is also less than spectacular but why bother optimizing R code?

StefanKarpinski · January 9, 2020, 3:45am

Y’all are missing the big one and first thing to check: the code uses non-constant globals.

https://github.com/vduarte/benchmarkingML/pull/1

More speedup is certainly possible, but this is the big performance killer.

jling · January 9, 2020, 4:35am

I actually tried to make them global and saw no significant improvement.

StefanKarpinski · January 9, 2020, 5:11am

Do you mean const?

jling · January 9, 2020, 5:14am

yes, exactly as your pr,

10975.95670223236% # with const                                                                                                                                    

11052.845978736877%

StefanKarpinski · January 9, 2020, 5:16am

Strange.

Albert_Zevelev · January 9, 2020, 5:17am

They find similar results for option pricing:
https://github.com/vduarte/benchmarkingML/tree/master/LSMC
I’m not sure if this also applies

Sijun · January 9, 2020, 6:23am

Profiler shows that most of time is spent on broadcasting

c = @. y .- Qnext .* Bnext .+ B
and
m = @. u(c, γ) .+ β * EV

Maybe Julia’s broadcasting has more room for improvement?

DNF · January 9, 2020, 6:25am

Why the double broadcasting? Both . and @., I mean.

Sijun · January 9, 2020, 6:27am

I tried with and without it. Without @. took about 10% more time.

DNF · January 9, 2020, 7:20am

You should just use @., then, not both.

Sijun · January 9, 2020, 8:19am

There seems to be some performance difference in broadcasting operation between Julia and numpy.

C = np.random.rand(256, 1351, 1351)

%%timeit
C**0.3

15.3 s ± 805 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare to:

const C=rand(256, 1351, 1351)

@time C.^0.3  # second time calling

34.271383 seconds (8 allocations: 3.481 GiB, 0.01% gc time)

Topic		Replies	Views
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1873	June 25, 2023
Julia can be dramatically slower than Matlab when solving 2D PDE New to Julia performance , matlab , pde	42	1193	November 9, 2024
Why Are Languages Like Python and MATLAB So Much Slower than Julia? Internals & Design question , first-steps , performance	8	2577	December 7, 2020
Matlab versus Julia General Usage	33	4935	July 15, 2021
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4338	December 30, 2022

Julia slower than Matlab & Python? No

Related topics