Kron vs scalar product speed difference. python code faster?

Final result on ajf/rowvector/af9a28f

 This RBM used scalar product
        BenchmarkTools.Trial: 
  memory estimate:  1.42 mb
  allocs estimate:  814
  --------------
  minimum time:     72.793 ms (0.00% GC)
  median time:      73.184 ms (0.00% GC)
  mean time:        74.461 ms (0.05% GC)
  maximum time:     83.571 ms (0.00% GC)
  --------------
  samples:          68
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Edit: Same results for with or without interpolation into @benchmark.

2 Likes

Thank you for taking the time and making an effort to explain things to me. I learned a few new things today, which makes this a good day in my book.

I think I understand this aspect well enough,ā€¦ or maybe not. The point I donā€™t get is how globals benefit one and hurt the other.

https://gist.github.com/Evizero/49dbc204b772ed63f6fbbf573e8a21da

2 Likes

Thanks!
ā† 20 char word limit ā†’

As a partial aisde, Iā€™ve been working on some code lately in which this naive implementation of sigmoid (which would be called invlogit in the statistics world) is exactly the main source of problems: this exact definition uses a large proportion of total time in the inner loop of my code and itā€™s also the primary source of numeric issues because of the limited range of x for which it generates a result thatā€™s not exactly 0 or 1.

Also worth noting that StatsFuns.jl implements this function as logistic using exactly this naive formulation with some added generic typing: https://github.com/JuliaStats/StatsFuns.jl/blob/ed4867460e4df2cb3ffd52dcde3875b4b635572f/src/basicfuns.jl#L12

In statistics, you could probably do even better than optimizing sigmoid by noting that this function is essentially always mixed with other functions: you typically end up needing a mixture of log(sigmoid(x)) and log(1 - sigmoid(x)) for fitting logistic regression models, so optimizing those compositions could provide even greater improvements.

Thank you for taking the time to look at the code. I didnā€™t know about the @view option or about the fact that .+= avoids realocating memory. I have tested @Evizero code but I still get a lot of GC. This is what I get

memory estimate: 1.33 gb
allocs estimate: 22845

minimum time: 601.147 ms (20.87% GC)
median time: 633.693 ms (20.14% GC)
mean time: 624.172 ms (20.33% GC)
maximum time: 648.480 ms (19.94% GC)

Should I test the code in julia 0.6? I am testing it in julia 0.5.

I get the same result on the last version in 0.5. (the last version is really 0.6 focused)

for 0.6 master: memory estimate: 3.86 mb

Iā€™m surprised nobody has mentioned BLAS yet - Iā€™ve implemented exactly the same model a while ago, and BLAS gave pretty good performance boost and zero memory allocation (at least at these earlier days of Julia). For example, in real-life settings the expression above would work with mini-batches of, say, 1000 vectors (e.g. size(ehp) == (255, 1000); size(x) == (784, 1000)) and can be rewritten as:

@benchmark Delta_W = lr .* (ehp * x' .- ehn * xneg')     
 BenchmarkTools.Trial:       
   memory estimate:  5.38 mb 
   allocs estimate:  22   
   --------------                                                
   minimum time:     12.316 ms (0.00% GC)    
   median time:      29.460 ms (0.00% GC)      
   mean time:        28.002 ms (0.76% GC)       
   maximum time:     50.396 ms (0.00% GC)   
   --------------                  
   samples:          179    
   evals/sample:     1     
   time tolerance:   5.00%    
   memory tolerance: 1.00%

or using BLAS:

@benchmark begin                                        
     gemm!('N', 'T', 1.0, ehp, x, 0.0, buf)              
     gemm!('N', 'T', 1.0, ehn, xneg, 0.0, Delta_W)  
     axpy!(-1.0, Delta_W, buf)        
     scal!(176400, lr, Delta_W, 1)   
 end                                          
 BenchmarkTools.Trial:              
   memory estimate:  0.00 bytes 
   allocs estimate:  0 
   --------------                                                
   minimum time:     10.181 ms (0.00% GC)
   median time:      10.623 ms (0.00% GC)  
   mean time:        11.759 ms (0.00% GC)    
   maximum time:     45.104 ms (0.00% GC)     
   --------------                         
   samples:          426           
   evals/sample:     1             
   time tolerance:   5.00%     
   memory tolerance: 1.00%

Although I havenā€™t checked it on 0.6, so latest Julia may still beat these results.

EDIT: Made a mistake which resulted in unfair comparison. The gist I initially posted here in this exact post was wrong as the BLAS implementation is intended to do the whole loop at once, while the other do just one iteration. This mistake does not affect anything I posted above

Ah. I think I have a conceptual error here and am comparing the wrong things

EDIT: this is only in reference to my last post concerning BLAS

@Evizero, it might be nice to have some version of this as a benchmark in the BaseBenchmarks package, to help use track Julia performance, if you want to do a pull request.

As soon as you are working with multiple vectors at once, so that you are doing BLAS3 operations (matrix multiplications etc.), then I agree that you definitely want to exploit a fast BLAS (like the OpenBLAS that Julia links). However, you donā€™t generally need to call low-level BLAS functions like gemm! directly. A_mul_Bt! and similar high-level functions are just as efficient.

For BLAS1 operations like axpy! and scal!, you are probably better off with the fusing broadcast operations. The increase in locality and reduction of other overheads that you get from fusing the loops will beat the minor optimizations that OpenBLAS can do for BLAS1 operations.

(And if most of your time is spent in BLAS3 operations, then you should expect essentially the same performance in Julia and NumPy, assuming that they are using the same BLAS library.)

4 Likes

Thank you a lot for your time. It turns out that most of the speed came from a transpose that your code avoided, not from the fact that you used A_mul_B!..

In your version you wrote
Delta_W .+= lr .* (ehp .* xā€™ .- ehn .* xnegā€™)

In my version I had
Delta_W .+= lr * ( x * ehpā€™ - xneg * ehnā€™)ā€™

Here there are some tests
https://github.com/davidbp/learn_julia/blob/master/speed_tests/comparing_functions.ipynb

I think youā€™d be happy to hear that Julia in v0.6 will ā€œtake transposes seriouslyā€ via the type system, making it essentially a free operation. It was a huge thread, culminating in:

1 Like

I already tried the code in 0.6 and the penalty of the transpose seemed huge.

Less than 24 hours ago?

Downloaded 0.6 dev 3 hours agoā€¦ (macOS)

Julia Version 0.6.0-dev.2069
Commit ff9a949 (2017-01-13 02:17 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Coreā„¢ i7-4650U CPU @ 1.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

But how old is the nightly you used?

https://status.julialang.org/

Most nightlies havenā€™t had an update since then (only COPR has), so if you didnā€™t build it from source or use COPR, you wonā€™t have that update. It will say how many days from the last master at the top of the REPL when you open it up.

Thatā€™s a day old master and wonā€™t have it. Iā€™d retry when itā€™s in the nightly.

@davidbp is talking about the outer matrix transpose. I donā€™t think the rowvector PR affects that

EDIT: also, it looks like lr * (...)' gets translated into A_mul_Bc(lr, ...), so I donā€™t think (?) that the outer transpose influences much in your particular code.

Aside from that, also observe how your version uses * and the other .*. The * (without the dot) causes broadcast fusion to stop, which will cause the creation of temporary arrays. So really there are a few non-obvious sources of performance penalties.

The change @ChrisRackauckas is talking about affects the inner vector transposes and so gives an additional performance boost.