Kron vs scalar product speed difference. python code faster?

Hello,

I was re implementing some python code that I had and I realized that it was much slower in julia.
The two functions I am comparing are written below and can be found here (https://github.com/davidbp/learn_julia/tree/master/speed_tests).

Python code takes 0.4 seconds and julia code 1.8 or 3.5 depending if I use the standard product or a kron (I also do not understand the huge speed difference about these two functions).

The code is not fully vectorized and I was expecting it to be faster in julia than in python. The fully vectorized version in python is even faster.

The code in python and julia is essentially a translation but the number of allocations and gc time in the Julia code suggest I am doing something wrong.

python rbm.py
— 0.41869401931762695 seconds —

julia rbm.jl

This RBM used scalar product
1.811265 seconds (580.21 k allocations: 1.631 GB, 16.26% gc time)

This RBM used kron product
3.495652 seconds (2.62 M allocations: 1.673 GB, 4.42% gc time)

##### Julia  function

function contrastive_divergence_col_K(Xbatch, rbm, K::Int64, lr::Float64)
        
    batch_size = size(Xbatch)[2]

    Delta_W = zeros(size(rbm.W))
    Delta_b = zeros(size(rbm.vis_bias))
    Delta_c = zeros(size(rbm.hid_bias))

    for i in 1:batch_size
        x =  Xbatch[:,i]
        xneg = Xbatch[:,i]

        for k in 1:K
            hneg = sigmoid( rbm.W * xneg .+ rbm.hid_bias) .> rand(rbm.n_hid)
            xneg = sigmoid( rbm.W' * hneg .+ rbm.vis_bias) .> rand(rbm.n_vis)
        end

        ehp = sigmoid(rbm.W * x + rbm.hid_bias)
        ehn = sigmoid(rbm.W * xneg + rbm.hid_bias)
 
        ### kron vs no kron???
        #Delta_W += lr * ( x * ehp' - xneg * ehn')'
        Delta_W += lr * (kron(x, ehp') - kron(xneg, ehn'))'
        Delta_b += lr * (x - xneg)
        Delta_c += lr * (ehp - ehn)

    end

    rbm.W += Delta_W / batch_size;
    rbm.vis_bias += Delta_b / batch_size;
    rbm.hid_bias += Delta_c / batch_size;
    
    return 
##### Python function

def contrastive_divergence_rows_K(Xbatch, rbm, K, lr):
        
    batch_size = Xbatch.shape[0]

    Delta_W = np.zeros(rbm.W.shape)
    Delta_b = np.zeros([rbm.n_vis,1])
    Delta_c = np.zeros([rbm.n_hid,1])

    for i in range(batch_size):
        x =  Xbatch[i:i+1,:]
        xneg = Xbatch[i:i+1,:]

        for k in range(K):
            hneg = sigmoid( np.dot(xneg,rbm.W) + rbm.hid_bias.T) > np.random.rand(1,rbm.n_hid)
            xneg = sigmoid( np.dot(hneg,rbm.W.T) + rbm.vis_bias.T) > np.random.rand(1,rbm.n_vis)

        ehp = sigmoid( np.dot(x,rbm.W) + rbm.hid_bias.T)
        ehn = sigmoid( np.dot(xneg,rbm.W) + rbm.hid_bias.T)
        
        Delta_W += lr * ( x * ehp.T - xneg * ehn.T).T
        Delta_b += lr * (x - xneg).T
        Delta_c += lr * (ehp - ehn).T
    
    rbm.W += Delta_W / batch_size;
    rbm.vis_bias += Delta_b / batch_size;
    rbm.hid_bias += Delta_c / batch_size;
    
    return

Have you tried the performance tips? In particular, are you running @time on the second run? Have you profiled your code? Where are the hotspots?

You are allocating tons of temporary arrays in your Julia code, even more than in Python. (Slices create views in Numpy and copies in Julia (you can use @view to make a view instead of a alice); x += y is an in-place updating operator in Numpy but a synonym for x = x + y in Julia, which allocates a new array). ' creates copies too (but will probably not in Julia 0.6).

You haven’t described your rbm data structure or the type of Xbatch, so that is hard to evaluate.

Julia 0.6 will have fusing .+ and other “dot” operators that can avoid all of the temporaries and let you use pre-allocated arrays and do everything in-place (though it will still require you to modify your code somewhat).

The huge difference in allocation count between kron and x * y' suggests to me that you have a type instability. Usually a very minor tweak will fix this. (Declaring ::Int64 and ::Float64 explicitly in the function arguments does nothing for performance, by the way; Julia’s JIT compiler automatically specializes the function for the argument types that are passed.) What is the output of @code_warntype?

1 Like

This is not in place in julia and it is in python. If you are on master, you can make this fully in place with .+=. Write the loop explicitly should also help reducing the allocation further.

This is also only in place in python and you are even doing it twice. Either use @view or write out the loop explicitly.

Hi there, I did some of the basic optimizations that caught my eye (without testing for correctness), such as preallocation and such. Admittedly I didn’t spend so much time on it so there may be some oversights, but it seems to do the same thing

https://github.com/Evizero/learn_julia/blob/master/speed_tests/rbm.jl#L2-L75

Results on my machine:


python

--- 0.441383123398 seconds ---


julia 0.5:

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  1.07 gb
  allocs estimate:  17850
  --------------
  minimum time:     394.541 ms (4.34% GC)
  median time:      397.300 ms (4.33% GC)
  mean time:        403.281 ms (5.74% GC)
  maximum time:     471.767 ms (20.14% GC)
  --------------
  samples:          13
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia 0.6 (1 day old master)

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  278.66 mb
  allocs estimate:  1219
  --------------
  minimum time:     271.952 ms (2.07% GC)
  median time:      273.818 ms (2.09% GC)
  mean time:        286.079 ms (6.11% GC)
  maximum time:     379.995 ms (29.19% GC)
  --------------
  samples:          18
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

(By the way, a common experience for people coming from Matlab or Python is that the first Julia code you write is slow, because line-for-line translations to Julia are often a poor fit. Typically, small tweaks make the code 10x faster or more, and I expect that’s the case for your code, which looks like it is leaving a lot of performance on the table.)

Nice job!

You are still creating copies with your slices. Try x = @view Xbatch[:,i] and xneg = @view Xbatch[:,i] to create views instead.

Note that the rand(n) function still allocates a new array on each inner-loop iteration. In 0.5, you can use rand! with a pre-allocated array. In 0.6, you can just do ... .> rand.() and it will do the right thing (call rand() for each element, with no allocations).

The code:

        A_mul_Bt!(k1, x, ehp)
        A_mul_Bt!(k2, xneg, ehn)
        k3 .= k1 .- k2
        Delta_W .+= lr .* k3'

should really be:

Delta_W .+= lr .* (ehp .* x' .- ehn .* xneg')

so that the loops are all fused and no temporaries are created (in 0.6)…you especially don’t want to create big temporary matrices k1 and k2 and k3. (Note that the final matrix transpose can be combined with the earlier transpositions to eliminate the final transpose.) The only problem is that the transposes x' and xneg' create copies, but this will be fixed in 0.6 once https://github.com/JuliaLang/julia/pull/19670 is merged.

(sigmoid is a pretty common special function. I bet that it can be made significantly faster than 1/(1+exp(-x)) e.g. by optimizing minimax rational functions or polynomials specifically for sigmoid… this would be a nice project for a SpecialFunctions package [at some point, many of the special functions like erf and gamma are likely to be moved out of Base into a package]. Implementing specialized code for special functions vectorizes terribly, so it is the sort of thing that you can only implement in Julia or a static language like C. e.g. 0.6 replaces the libm exp function with a pure-Julia exp that is around 15–30% faster (https://github.com/JuliaLang/julia/pull/19831), and the potential speedups for sigmoid are much greater than this.)

2 Likes

I don’t really understand what is happening in this step to be honest (aside from combining the transpose). In the original code x * ehp' results is a (784,1) x (1,225) matrix multiplication which results in a (784,225) matrix. Can I really compute the result using just element-wise operation?

Yes, because .* is not element-wise multiplication. It is a broadcast operation. For the special-case of two arrays that are the same shape, it is elementwise, but for combining a row vector and a column vector it “expands” out into a matrix.

2 Likes

Without this last bit, but with the other feedback this is the current result (pushed to github):

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  6.65 mb
  allocs estimate:  418
  --------------
  minimum time:     147.375 ms (0.00% GC)
  median time:      148.307 ms (0.00% GC)
  mean time:        148.604 ms (0.11% GC)
  maximum time:     154.239 ms (0.00% GC)
  --------------
  samples:          34
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

You forgot xneg = @view Xbatch[:,i]

xneg is mutated in the inner loop. It would seem like cheating a bit to change the input data

1 Like

Oh, right. In that case you can pre-allocate it and do

xneg .= @view Xbatch[:,i]

Interestingly with the broadcast trick and the view (although it is mainly the broadcast trick) it got a little worse

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  3.87 mb
  allocs estimate:  1614
  --------------
  minimum time:     172.006 ms (0.00% GC)
  median time:      172.904 ms (0.00% GC)
  mean time:        173.232 ms (0.08% GC)
  maximum time:     177.797 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I can’t reproduce this.

foo(A,a,x1,y1,x2,y2) = A .+= a .* (x1 .* y1' .- x2 .* y2')
function bar(A,a,x1,y1,x2,y2, T1,T2)
       A_mul_Bt!(T1, x1, y1)
       A_mul_Bt!(T2, x2, y2)
       A .+= a .* (T1 .- T2)
end
m, n = 784, 225
A = zeros(m,n); a = 1.0; x1 = zeros(m); y1 = zeros(n); x2 = copy(x1); y2 = copy(y1); T1 = copy(A); T2 = copy(A);

using BenchmarkTools
@benchmark foo($A, $a, $x1, $y1, $x2, $y2)
@benchmark bar($A, $a, $x1, $y1, $x2, $y2, $T1, $T2)

On my machine, foo is about 360µs and bar is about 514µs, so fusing the loops and avoiding the T1 and T2 arrays gave me a 40% speedup. (This is only on 0.6, of course.)

(On the RowVector branch of 0.6, where transpose avoids the copy, it is only 140µs on my machine, which is a pretty stunning speedup.)

I can reproduce your observation (350us foo vs 460us bar). Interestingly if I don’t interpolate the variables into @benchmark, bar is faster and foo slower (420us foo vs 370us bar). Could it be that the interpolation of the additional two variables puts bar at a disadvantage? To be honest, based on what I can read at the BenchmarkTools manual, I don’t understand why the one speeds up and the other slows down. I checked multiple times. Beside the point I guess.

I’ll try the RowVector branch

This thread is really interesting reading. It would be super cool if you could post the final optimized code (to be compared to the code in the OP) when this has been optimized completely - I think that would be a cool learning resource.

2 Likes

As explained in the BenchmarkTools documentation, you should generally interpolate all function parameters in a @benchmark expression. The reasons is that the parameters are often (as here) global variables, and you don’t want to benchmark the slow type inference on the global variables. Interpolating basically means that the parameters are evaluated before benchmarking, so that once the benchmark starts Julia already knows their types.