Kron vs scalar product speed difference. python code faster?

Evizero · January 13, 2017, 2:59pm

Final result on ajf/rowvector/af9a28f

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  1.42 mb
  allocs estimate:  814
  --------------
  minimum time:     72.793 ms (0.00% GC)
  median time:      73.184 ms (0.00% GC)
  mean time:        74.461 ms (0.05% GC)
  maximum time:     83.571 ms (0.00% GC)
  --------------
  samples:          68
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Edit: Same results for with or without interpolation into @benchmark.

Evizero · January 13, 2017, 3:02pm

Thank you for taking the time and making an effort to explain things to me. I learned a few new things today, which makes this a good day in my book.

I think I understand this aspect well enough,… or maybe not. The point I don’t get is how globals benefit one and hurt the other.

Evizero · January 13, 2017, 3:14pm

https://gist.github.com/Evizero/49dbc204b772ed63f6fbbf573e8a21da

mkborregaard · January 13, 2017, 3:17pm

Thanks!
← 20 char word limit →

johnmyleswhite · January 13, 2017, 3:58pm

As a partial aisde, I’ve been working on some code lately in which this naive implementation of sigmoid (which would be called invlogit in the statistics world) is exactly the main source of problems: this exact definition uses a large proportion of total time in the inner loop of my code and it’s also the primary source of numeric issues because of the limited range of x for which it generates a result that’s not exactly 0 or 1.

Also worth noting that StatsFuns.jl implements this function as logistic using exactly this naive formulation with some added generic typing: https://github.com/JuliaStats/StatsFuns.jl/blob/ed4867460e4df2cb3ffd52dcde3875b4b635572f/src/basicfuns.jl#L12

In statistics, you could probably do even better than optimizing sigmoid by noting that this function is essentially always mixed with other functions: you typically end up needing a mixture of log(sigmoid(x)) and log(1 - sigmoid(x)) for fitting logistic regression models, so optimizing those compositions could provide even greater improvements.

davidbp · January 13, 2017, 6:49pm

Thank you for taking the time to look at the code. I didn’t know about the @view option or about the fact that .+= avoids realocating memory. I have tested @Evizero code but I still get a lot of GC. This is what I get

memory estimate: 1.33 gb
allocs estimate: 22845

minimum time: 601.147 ms (20.87% GC)
median time: 633.693 ms (20.14% GC)
mean time: 624.172 ms (20.33% GC)
maximum time: 648.480 ms (19.94% GC)

Should I test the code in julia 0.6? I am testing it in julia 0.5.

Evizero · January 13, 2017, 6:56pm

I get the same result on the last version in 0.5. (the last version is really 0.6 focused)

for 0.6 master: memory estimate: 3.86 mb

dfdx · January 13, 2017, 8:38pm

I’m surprised nobody has mentioned BLAS yet - I’ve implemented exactly the same model a while ago, and BLAS gave pretty good performance boost and zero memory allocation (at least at these earlier days of Julia). For example, in real-life settings the expression above would work with mini-batches of, say, 1000 vectors (e.g. size(ehp) == (255, 1000); size(x) == (784, 1000)) and can be rewritten as:

@benchmark Delta_W = lr .* (ehp * x' .- ehn * xneg')     
 BenchmarkTools.Trial:       
   memory estimate:  5.38 mb 
   allocs estimate:  22   
   --------------                                                
   minimum time:     12.316 ms (0.00% GC)    
   median time:      29.460 ms (0.00% GC)      
   mean time:        28.002 ms (0.76% GC)       
   maximum time:     50.396 ms (0.00% GC)   
   --------------                  
   samples:          179    
   evals/sample:     1     
   time tolerance:   5.00%    
   memory tolerance: 1.00%

or using BLAS:

@benchmark begin                                        
     gemm!('N', 'T', 1.0, ehp, x, 0.0, buf)              
     gemm!('N', 'T', 1.0, ehn, xneg, 0.0, Delta_W)  
     axpy!(-1.0, Delta_W, buf)        
     scal!(176400, lr, Delta_W, 1)   
 end                                          
 BenchmarkTools.Trial:              
   memory estimate:  0.00 bytes 
   allocs estimate:  0 
   --------------                                                
   minimum time:     10.181 ms (0.00% GC)
   median time:      10.623 ms (0.00% GC)  
   mean time:        11.759 ms (0.00% GC)    
   maximum time:     45.104 ms (0.00% GC)     
   --------------                         
   samples:          426           
   evals/sample:     1             
   time tolerance:   5.00%     
   memory tolerance: 1.00%

Although I haven’t checked it on 0.6, so latest Julia may still beat these results.

Evizero · January 13, 2017, 8:56pm

EDIT: Made a mistake which resulted in unfair comparison. The gist I initially posted here in this exact post was wrong as the BLAS implementation is intended to do the whole loop at once, while the other do just one iteration. This mistake does not affect anything I posted above

Evizero · January 13, 2017, 9:24pm

Ah. I think I have a conceptual error here and am comparing the wrong things

EDIT: this is only in reference to my last post concerning BLAS

stevengj · January 13, 2017, 10:46pm

@Evizero, it might be nice to have some version of this as a benchmark in the BaseBenchmarks package, to help use track Julia performance, if you want to do a pull request.

stevengj · January 13, 2017, 10:53pm

As soon as you are working with multiple vectors at once, so that you are doing BLAS3 operations (matrix multiplications etc.), then I agree that you definitely want to exploit a fast BLAS (like the OpenBLAS that Julia links). However, you don’t generally need to call low-level BLAS functions like gemm! directly. A_mul_Bt! and similar high-level functions are just as efficient.

For BLAS1 operations like axpy! and scal!, you are probably better off with the fusing broadcast operations. The increase in locality and reduction of other overheads that you get from fusing the loops will beat the minor optimizations that OpenBLAS can do for BLAS1 operations.

(And if most of your time is spent in BLAS3 operations, then you should expect essentially the same performance in Julia and NumPy, assuming that they are using the same BLAS library.)

davidbp · January 14, 2017, 9:39pm

Thank you a lot for your time. It turns out that most of the speed came from a transpose that your code avoided, not from the fact that you used A_mul_B!..

In your version you wrote
Delta_W .+= lr .* (ehp .* x’ .- ehn .* xneg’)

In my version I had
Delta_W .+= lr * ( x * ehp’ - xneg * ehn’)’

Here there are some tests
https://github.com/davidbp/learn_julia/blob/master/speed_tests/comparing_functions.ipynb

ChrisRackauckas · January 14, 2017, 9:45pm

I think you’d be happy to hear that Julia in v0.6 will “take transposes seriously” via the type system, making it essentially a free operation. It was a huge thread, culminating in:

github.com/JuliaLang/julia

Taking vector transposes seriously

opened 10:12PM - 10 Nov 13 UTC

closed 10:44PM - 13 Jan 17 UTC

jiahao

breaking linear algebra arrays design

from @alanedelman: We really should think carefully about how the transpose of …a vector should dispatch the various `A_*op*_B*` methods. It must be possible to avoid new types and ugly mathematics. For example, vector'vector yielding a vector (#2472, #2936), vector' yielding a matrix, and vector'' yielding a matrix (#2686) are all bad mathematics. What works for me mathematically (which avoids introducing a new type) is that for a 1-dimensional `Vector` `v`: - `v'` is a no-op (i.e. just returns `v`), - `v'v` or `v'*v` is a scalar, - `v*v'` is a matrix, and - `v'A` or `v'*A` (where `A` is an `AbstractMatrix`) is a vector A general _N_-dimensional transpose reverses the order of indices. A vector, having one index, should be invariant under transposition. In practice `v'` is rarely used in isolation, and is usually encountered in matrix-vector products and matrix-matrix products. A common example would be to construct bilinear forms `v'A*w` and quadratic forms `v'A*v` which are used in conjugate gradients, Rayleigh quotients, etc. The only reason to introduce a new `Transpose{Vector}` type would be to represent the difference between contravariant and covariant vectors, and I don't find this compelling enough.

davidbp · January 14, 2017, 9:49pm

I already tried the code in 0.6 and the penalty of the transpose seemed huge.

ChrisRackauckas · January 14, 2017, 9:50pm

Less than 24 hours ago?

davidbp · January 14, 2017, 9:52pm

Downloaded 0.6 dev 3 hours ago… (macOS)

Julia Version 0.6.0-dev.2069
Commit ff9a949 (2017-01-13 02:17 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core™ i7-4650U CPU @ 1.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

ChrisRackauckas · January 14, 2017, 9:54pm

But how old is the nightly you used?

https://status.julialang.org/

Most nightlies haven’t had an update since then (only COPR has), so if you didn’t build it from source or use COPR, you won’t have that update. It will say how many days from the last master at the top of the REPL when you open it up.

ChrisRackauckas · January 14, 2017, 9:55pm

That’s a day old master and won’t have it. I’d retry when it’s in the nightly.

Evizero · January 14, 2017, 10:07pm

@davidbp is talking about the outer matrix transpose. I don’t think the rowvector PR affects that

EDIT: also, it looks like lr * (...)' gets translated into A_mul_Bc(lr, ...), so I don’t think (?) that the outer transpose influences much in your particular code.

Aside from that, also observe how your version uses * and the other .*. The * (without the dot) causes broadcast fusion to stop, which will cause the creation of temporary arrays. So really there are a few non-obvious sources of performance penalties.

The change @ChrisRackauckas is talking about affects the inner vector transposes and so gives an additional performance boost.

Topic		Replies	Views
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3561	October 24, 2017
Why is this simple function twice as slow as its Python version Performance question	97	4319	April 12, 2021
My Julia code is slower than Python and Matlab Performance question	26	1175	March 4, 2025
Speeding up some non-optimized Julia functions Performance	47	1159	July 19, 2022
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4539	December 20, 2017

Kron vs scalar product speed difference. python code faster?

memory estimate: 1.33 gb allocs estimate: 22845

Related topics

memory estimate: 1.33 gb
allocs estimate: 22845