Dot function


#1

I’m looking at some Numpy code and try to convert it to Julia for performance comparison. (If you’re interested, it’s the logistic regression example from Numba project.)

I realized that Numpy’s dot function possess a few different behaviors (e.g. passing 2-D matrices gives matrix multiplication result, passing N-D matrix and 1-D matrix gives sum product results, etc.) Julia’s dot function, on the other hand, is extremely simple and straightforward.

Does anyone know how Numpy get to this state? Is it more convenient/useful for specific use cases? Just curious…

Reference:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot


#2

Not related to your question, but since both Julia and numpy use the same BLAS libraries, you’re unlikely to see much difference in your performance comparison unless your code spends a lot of time OUTSIDE of the linear algebra code. (Unless they’re using different BLAS libraries and/or the libraries were compiled with different optimizations…)

Cheers,
Kevin


#3

Going off topic:

I’m trying to do a reality check since the last epic discussion about “is numba good enough”…

My test result for this example shows Julia being 2-3x slower. Maybe the code wasn’t translated in a good way? (data set: https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

Python/Numba:

In [64]: @numba.jit
    ...: def logistic_regression3(Y, X, w, iterations):
    ...:     for i in range(iterations):
    ...:         w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    ...:     return w
    ...: 

In [65]: %timeit logistic_regression3(Y2, X2, w, 1000)
22.3 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [66]: Y.shape
Out[66]: (1599,)

In [67]: X.shape
Out[67]: (1599, 2)

In [68]: w.shape
Out[68]: (2,)

Julia:

using CSV, DataFrames
df = CSV.read("/Users/tomkwong/Downloads/winequality-red.csv", 
        delim=";", types=Dict(6=>Float64, 7=>Float64))
X = convert(Array, df[[Symbol("fixed acidity"), Symbol("volatile acidity")]])
Y = df[:quality]
w = ones(size(X)[2])

function logistic_regression(Y, X, w, iterations)
    for i in 1:iterations
        w -= (((1.0 ./ (1.0 .+ e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
    end
    w
end

using BenchmarkTools
@benchmark logistic_regression($Y, $X, $w, 1000)

Julia results:

julia> @benchmark logistic_regression($Y, $X, $w, 1000)
BenchmarkTools.Trial: 
  memory estimate:  86.49 MiB
  allocs estimate:  9000
  --------------
  minimum time:     60.454 ms (9.32% GC)
  median time:      68.327 ms (14.48% GC)
  mean time:        68.327 ms (14.23% GC)
  maximum time:     80.246 ms (13.99% GC)
  --------------
  samples:          74
  evals/sample:     1

Keeping track of simple vs high-performance versions
#4

Don’t allocate. Just cache the multiplications and do them in place. And don’t transpose so much, make the data structure correct.


#5

On my system, profiling shows most of the time in calls to the exponentiation function. I think Numpy passes this off to MKL, but Julia/LLVM (at least in v0.6) uses a serial version.

If any developers are listening, could they say whether the version of LLVM to be used in v0.7/v1.0 is likely to use a vector math library? I think @vchuravy may have been tracking this?


#6

You can swap this out yourself with VML.jl.

It might be worth trying @fastmath here as well, or an alternative math library like Sleef.jl


#7

Numpy’s dot is more like plain old “*”. Ie,

u = randn(10);
v = rand(10);
u' * v #scalar
A = randn(10,10);
B = rand(10,10);
A * B #matrix
A * u #column vector
v' * B #row vector

I don’t have that csv. If you could create an example that we could copy and paste, it would be easier to try suggestions instead of throwing ideas out there.
I’d remove the allocations, like Chris suggested.

I didn’t feel like generating data to test this, so no idea if it actually works. [EDIT: Fixed a couple things.]

function logistic_regression(Y, X, w, iterations, step = -1.0)
    Xw = similar(w, size(X,1))
    buffer = similar(w, length(Y));
    for i in 1:iterations
        A_mul_B!(Xw, X, w)
        @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
        Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
    end
    w
end

I’m sure someone can come up with something a lot slicker. Ie, without an overt BLAS call.

EDIT:
What is the feasibility of having C .= A * B lower to mul!(C, A, B) and C .+= A * B to the appropriate gemm!-type call?
I remember trying earlier and seeing allocations.


#8

Of course, but it’s painful to manage the necessary temporary arrays oneself.

This looks intriguing, but doesn’t it make the compiler do an awful lot of redundant work (assuming broadcast gets turned into inline SIMD)?


#9

That would be a little too magical. .= always lowers to a broadcast! (well, broadcast_c!) call. Having it not do that in a few cases would break the ability to accurately guess what Julia is doing (making it harder to overload behaviors). Also, you get edge cases like, what do you want C .+= A * B + D to mean? That starts getting into the realm of macros, and SugarBlas.jl or InPlaceOps.jl are great for this.


#10

The CSV file is here:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv


#11

Well you have to benchmark anyways so the numbers are consistent :stuck_out_tongue:


#12

Hmm, not much of an improvement.

julia> using CSV, DataFrames, BenchmarkTools

julia> df = CSV.read("/home/celrod/Documents/miscwork/winequality-red.csv", 
               delim=";", types=Dict(6=>Float64, 7=>Float64));

julia> X = convert(Array, df[[Symbol("fixed acidity"), Symbol("volatile acidity")]]);

julia> Y = df[:quality];

julia> w = ones(size(X)[2]);

julia> function logistic_regression(Y, X, w, iterations)
           for i in 1:iterations
               w -= (((1.0 ./ (1.0 .+ e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
           end
           w
       end
logistic_regression (generic function with 1 method)

julia> @benchmark logistic_regression($Y, $X, $w, 1000)
BenchmarkTools.Trial: 
  memory estimate:  61.83 MiB
  allocs estimate:  7000
  --------------
  minimum time:     55.850 ms (6.08% GC)
  median time:      57.943 ms (8.85% GC)
  mean time:        57.850 ms (8.42% GC)
  maximum time:     59.761 ms (9.55% GC)
  --------------
  samples:          87
  evals/sample:     1

julia> function logistic_regression!(w, Y, X, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression! (generic function with 2 methods)

julia> @benchmark logistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     29.327 ms (0.00% GC)
  median time:      30.318 ms (0.00% GC)
  mean time:        31.095 ms (0.00% GC)
  maximum time:     47.457 ms (0.00% GC)
  --------------
  samples:          161
  evals/sample:     1

julia> function fmlogistic_regression!(w, Y, X, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @fastmath @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
fmlogistic_regression! (generic function with 2 methods)

julia> @benchmark fmlogistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     29.347 ms (0.00% GC)
  median time:      31.733 ms (0.00% GC)
  mean time:        32.509 ms (0.00% GC)
  maximum time:     49.867 ms (0.00% GC)
  --------------
  samples:          154
  evals/sample:     1

julia> function fmlogistic_regression2!(w, Y, X, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + Base.FastMath.exp_fast(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
fmlogistic_regression2! (generic function with 2 methods)

julia> @benchmark fmlogistic_regression2!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     29.401 ms (0.00% GC)
  median time:      30.258 ms (0.00% GC)
  mean time:        30.968 ms (0.00% GC)
  maximum time:     47.596 ms (0.00% GC)
  --------------
  samples:          162
  evals/sample:     1

julia> using SLEEF

julia> function slogistic_regression!(w, Y, X, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + SLEEF.exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
slogistic_regression! (generic function with 2 methods)

julia> @benchmark slogistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     55.602 ms (0.00% GC)
  median time:      59.753 ms (0.00% GC)
  mean time:        60.275 ms (0.00% GC)
  maximum time:     79.889 ms (0.00% GC)
  --------------
  samples:          83
  evals/sample:     1

julia> using Yeppp

julia> function logistic_regressiony!(w, Y, X, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = - Y * Xw
               Yeppp.exp!(buffer)
               @. buffer = ((1.0 / (1.0 + buffer) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
logistic_regressiony! (generic function with 2 methods)

julia> @benchmark logistic_regressiony!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     26.766 ms (0.00% GC)
  median time:      27.792 ms (0.00% GC)
  mean time:        27.980 ms (0.00% GC)
  maximum time:     31.109 ms (0.00% GC)
  --------------
  samples:          179
  evals/sample:     1

Hmm…not much of an improvement. And plain old @fastmath continues to disappoint.

Hmm, especially now that I have to start adding Base.LinAlg there’s little reason not to take advantage of great stuff like that.


#13

Actually, the thought is that “hey, Julia is so great that you can write things naturally and it just performs well by default”. If I’m messing with the code to optimize it then it becomes uglier and defeats the purpose. You see, the numpy version is quite clean and it performs well.

So what’s the discrepancy? I saw the note below from the Julia manual - Differences from Python. If I replace w with w[:] in the code above, it doesn’t seem to do much difference.

Julia’s updating operators (e.g. +=, -=, …) are not in-place whereas NumPy’s are. This means A = ones(4); B = A; B += 3 doesn’t change values in A, it rather rebinds the name B to the result of the right- hand side B = B + 3, which is a new array. Use B[:] += 3, explicit loops, or InplaceOps.jl.


#14

I got a factor of 3 speedup with a vector math call for exp. (I used Yeppp.jl because I was too lazy to perform the incantations needed for MKL/VML.) This approach seems to require yet another buffer array.


#15

Actually, I would say that this is a bit of a myth that Julia people are somethings guilty of spreading. No language is going to be as pretty as Python when you’re doing what Python and the package devs intended you to do (like your above example).

However, when you start going off the main charted course and you’re in a land where the language and ecosystem are not supportive then good luck optimizing Python.

Put it this way: Basic, naïve Julia code is not going to be a magic bullet of elegant syntax and blazing speed. However, it has huge potential for optimization and you can really go crazy making something in Julia insanely performant once you know what you’re doing.

This is a feature that’s likely not too useful to beginner programmers but is essential to package developers and is part of why Julia has such a great package ecosystem for such a young language. If there is a package out there that is designed for what you want to do then you’re much more likely to be able to write beautiful code that performs well by default.


#16

Uhh… that example above writes np.dot instead of * for multiplication. That’s objectively less pretty, unless you prefer writing np.dot in your math homeworks. This isn’t the only case where Python needs a bunch of extra characters and “programming” to do math. Examples are abound for this:

https://cheatsheets.quantecon.org/

Python is pretty much always more verbose and further from math when writing mathematical algorithms. Python is fine, but there no need for some myth that NumPy syntax is not verbose.


#17

But, apparently the time has changed, and 1) Python has a large ecosystem and 2) Numba seems to work great (at least for the examples that I’ve seen). A simple 4-character decorator @jit seems to optimize away most of the performance problem. It’s however my naive observation – I am eager to learn if anyone actually uses Numba successfully for non-trivial use cases in production.


#18

That’s a good point, but for performance reasons, if I have to turn a one-line code into a for-loop or pre-allocate an array then Julia isn’t as pretty as it looks before…


#19

I got a nearly 2x speed up from removing allocations on the data set provided. 57 ms -> 30 ms.
Using Yeppp reduced it further to 12 ms. I updated my earlier comment.

Unfortunately, I’m running into the closure bug when trying to using Threads.@threads.

I think Julia does a good job enabling DSLs via macros and multiple dispatch, so that we can get to the point where you can just write pretty math, add the appropriate macro, and get high performance code.

Julia smooths the transition from using pretty packages with nice APIs, to actually being able to extend or develop them.
And while it enables them, those slick interfaces haven’t all been created yet.


#20

I searched, and then when I didn’t find any I asked around.

The answer is no, there really isn’t any. I also asked in chat rooms. I couldn’t find any and when I asked people who used Numba they didn’t know of any either. I found some Cython usage though. Numba is a pretty rough dependency, and things don’t compile/optimize together like they do in Julia (especially when considering interprocedural optimization of v0.7), so it doesn’t seem like it makes sense for package development, but it does really well for scripts.

Numba is more like ParallelAccelerator.jl than it is like Julia. It’s a DSL that speeds up some numerical codes by making a lot of assumptions about a constrained form and accelerating it. That can be very helpful, but just like ParallelAccelerator.jl its harder to build off of when you have less control and composability.

So if you’re looking at package development, it’s really Julia vs Cython vs wrapped C++ code (in R/MATLAB/Python/Julia/etc.)