I’m looking at some Numpy code and try to convert it to Julia for performance comparison. (If you’re interested, it’s the logistic regression example from Numba project.)

I realized that Numpy’s dot function possess a few different behaviors (e.g. passing 2-D matrices gives matrix multiplication result, passing N-D matrix and 1-D matrix gives sum product results, etc.) Julia’s dot function, on the other hand, is extremely simple and straightforward.

Does anyone know how Numpy get to this state? Is it more convenient/useful for specific use cases? Just curious…

Not related to your question, but since both Julia and numpy use the same BLAS libraries, you’re unlikely to see much difference in your performance comparison unless your code spends a lot of time OUTSIDE of the linear algebra code. (Unless they’re using different BLAS libraries and/or the libraries were compiled with different optimizations…)

In [64]: @numba.jit
...: def logistic_regression3(Y, X, w, iterations):
...: for i in range(iterations):
...: w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
...: return w
...:
In [65]: %timeit logistic_regression3(Y2, X2, w, 1000)
22.3 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [66]: Y.shape
Out[66]: (1599,)
In [67]: X.shape
Out[67]: (1599, 2)
In [68]: w.shape
Out[68]: (2,)

Julia:

using CSV, DataFrames
df = CSV.read("/Users/tomkwong/Downloads/winequality-red.csv",
delim=";", types=Dict(6=>Float64, 7=>Float64))
X = convert(Array, df[[Symbol("fixed acidity"), Symbol("volatile acidity")]])
Y = df[:quality]
w = ones(size(X)[2])
function logistic_regression(Y, X, w, iterations)
for i in 1:iterations
w -= (((1.0 ./ (1.0 .+ e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
end
w
end
using BenchmarkTools
@benchmark logistic_regression($Y, $X, $w, 1000)

Julia results:

julia> @benchmark logistic_regression($Y, $X, $w, 1000)
BenchmarkTools.Trial:
memory estimate: 86.49 MiB
allocs estimate: 9000
--------------
minimum time: 60.454 ms (9.32% GC)
median time: 68.327 ms (14.48% GC)
mean time: 68.327 ms (14.23% GC)
maximum time: 80.246 ms (13.99% GC)
--------------
samples: 74
evals/sample: 1

On my system, profiling shows most of the time in calls to the exponentiation function. I think Numpy passes this off to MKL, but Julia/LLVM (at least in v0.6) uses a serial version.

If any developers are listening, could they say whether the version of LLVM to be used in v0.7/v1.0 is likely to use a vector math library? I think @vchuravy may have been tracking this?

u = randn(10);
v = rand(10);
u' * v #scalar
A = randn(10,10);
B = rand(10,10);
A * B #matrix
A * u #column vector
v' * B #row vector

I don’t have that csv. If you could create an example that we could copy and paste, it would be easier to try suggestions instead of throwing ideas out there.
I’d remove the allocations, like Chris suggested.

I didn’t feel like generating data to test this, so no idea if it actually works. [EDIT: Fixed a couple things.]

function logistic_regression(Y, X, w, iterations, step = -1.0)
Xw = similar(w, size(X,1))
buffer = similar(w, length(Y));
for i in 1:iterations
A_mul_B!(Xw, X, w)
@. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
end
w
end

I’m sure someone can come up with something a lot slicker. Ie, without an overt BLAS call.

EDIT:
What is the feasibility of having C .= A * B lower to mul!(C, A, B) and C .+= A * B to the appropriate gemm!-type call?
I remember trying earlier and seeing allocations.

That would be a little too magical. .= always lowers to a broadcast! (well, broadcast_c!) call. Having it not do that in a few cases would break the ability to accurately guess what Julia is doing (making it harder to overload behaviors). Also, you get edge cases like, what do you want C .+= A * B + D to mean? That starts getting into the realm of macros, and SugarBlas.jl or InPlaceOps.jl are great for this.

Actually, the thought is that “hey, Julia is so great that you can write things naturally and it just performs well by default”. If I’m messing with the code to optimize it then it becomes uglier and defeats the purpose. You see, the numpy version is quite clean and it performs well.

So what’s the discrepancy? I saw the note below from the Julia manual - Differences from Python. If I replace w with w[:] in the code above, it doesn’t seem to do much difference.

Julia’s updating operators (e.g. +=, -=, …) are not in-place whereas NumPy’s are. This means A = ones(4); B = A; B += 3 doesn’t change values in A, it rather rebinds the name B to the result of the right- hand side B = B + 3, which is a new array. Use B[:] += 3, explicit loops, or InplaceOps.jl.

I got a factor of 3 speedup with a vector math call for exp. (I used Yeppp.jl because I was too lazy to perform the incantations needed for MKL/VML.) This approach seems to require yet another buffer array.

Actually, I would say that this is a bit of a myth that Julia people are somethings guilty of spreading. No language is going to be as pretty as Python when you’re doing what Python and the package devs intended you to do (like your above example).

However, when you start going off the main charted course and you’re in a land where the language and ecosystem are not supportive then good luck optimizing Python.

Put it this way: Basic, naïve Julia code is not going to be a magic bullet of elegant syntax and blazing speed. However, it has huge potential for optimization and you can really go crazy making something in Julia insanely performant once you know what you’re doing.

This is a feature that’s likely not too useful to beginner programmers but is essential to package developers and is part of why Julia has such a great package ecosystem for such a young language. If there is a package out there that is designed for what you want to do then you’re much more likely to be able to write beautiful code that performs well by default.

Uhh… that example above writes np.dot instead of * for multiplication. That’s objectively less pretty, unless you prefer writing np.dot in your math homeworks. This isn’t the only case where Python needs a bunch of extra characters and “programming” to do math. Examples are abound for this:

Python is pretty much always more verbose and further from math when writing mathematical algorithms. Python is fine, but there no need for some myth that NumPy syntax is not verbose.

But, apparently the time has changed, and 1) Python has a large ecosystem and 2) Numba seems to work great (at least for the examples that I’ve seen). A simple 4-character decorator @jit seems to optimize away most of the performance problem. It’s however my naive observation – I am eager to learn if anyone actually uses Numba successfully for non-trivial use cases in production.

That’s a good point, but for performance reasons, if I have to turn a one-line code into a for-loop or pre-allocate an array then Julia isn’t as pretty as it looks before…

I got a nearly 2x speed up from removing allocations on the data set provided. 57 ms -> 30 ms.
Using Yeppp reduced it further to 12 ms. I updated my earlier comment.

Unfortunately, I’m running into the closure bug when trying to using Threads.@threads.

I think Julia does a good job enabling DSLs via macros and multiple dispatch, so that we can get to the point where you can just write pretty math, add the appropriate macro, and get high performance code.

Julia smooths the transition from using pretty packages with nice APIs, to actually being able to extend or develop them.
And while it enables them, those slick interfaces haven’t all been created yet.

I searched, and then when I didn’t find any I asked around.

The answer is no, there really isn’t any. I also asked in chat rooms. I couldn’t find any and when I asked people who used Numba they didn’t know of any either. I found some Cython usage though. Numba is a pretty rough dependency, and things don’t compile/optimize together like they do in Julia (especially when considering interprocedural optimization of v0.7), so it doesn’t seem like it makes sense for package development, but it does really well for scripts.

Numba is more like ParallelAccelerator.jl than it is like Julia. It’s a DSL that speeds up some numerical codes by making a lot of assumptions about a constrained form and accelerating it. That can be very helpful, but just like ParallelAccelerator.jl its harder to build off of when you have less control and composability.

So if you’re looking at package development, it’s really Julia vs Cython vs wrapped C++ code (in R/MATLAB/Python/Julia/etc.)