Julia slower than Matlab & Python? No

It works now. Both “Simple” & “Fastest” are much faster, especially for bigger orders.

This would be great, even if it started with baby steps. It would mean that we could respond to performance issues with a single macro to get started, which is especially useful for dabblers of the language. For example, adding@inbounds to every loop generally never hurts (it can make things crash, but so would manually adding it to ever loop) so why not start with that and build things up?

the OpenBlas fix or the MKL?

Not sure which one.
First I ran the Blas fix & it was faster.
Then I checked & MKL now works & its faster.

1 Like

Great! Is the speedup significant (e.g. > 20%)?

Eventually it would be nice to see if the patched openblas is comparable to the MKL here.

For option pricing, for big orders its 30-40% faster than w/o LinearAlgebra.BLAS.set_num_threads(n)

1 Like

FWIW, on my machine, OpenBLAS:

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas64

julia> LinearAlgebra.peakflops(16_000)
4.478966395538375e11

julia> BLAS.set_num_threads(18)

julia> LinearAlgebra.peakflops(16_000)
8.023464806960204e11

vs MKL:

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> LinearAlgebra.peakflops(16_000)
2.1343568002173381e12

julia> BLAS.set_num_threads(18)

julia> LinearAlgebra.peakflops(16_000)
2.0939895828215435e12

julia> versioninfo()
Julia Version 1.5.0-DEV.89
Commit 5c75bec530* (2020-01-18 11:56 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 18

MKL seems around 2.6 times faster.
FWIW, I set the avx512 all-core clock speed to 4.1 GHz in the bios.
This means the CPU’s theoretical peak flops should be

julia> num_cores = 18;

julia> clock = 4.1e9;

julia> fma_per_clock = 2;

julia> flop_per_fma = 16; # avx512, double precision: 8 mul, 8 add

julia> num_cores * clock * fma_per_clock * flop_per_fma
2.3616e12

MKL gets fairly close. OpenBLAS does not.

EDIT:
I’m going to have to take a good look at ModelingToolkit.
It looks interesting. I wonder if I can take advantage of it in my own work.

4 Likes

This is getting spicier than I like, but I’m going to continue because even after reading yet again more on the subject, I don’t understand how ModelingToolkit is relevant to the sort of problems that the original post in this gigantic thread is about. Consider this file: https://github.com/vduarte/benchmarkingML/blob/master/Sovereign_Default/julia.jl and pretend we don’t know the size of the input data https://github.com/vduarte/benchmarkingML/blob/master/Sovereign_Default/julia.jl#L4 and https://github.com/vduarte/benchmarkingML/blob/master/Sovereign_Default/julia.jl#L5 ahead of time.

How would one use ModelingToolkit to build up a graph and optimize the array operations in that code? Insofar as ModelingToolkit can understand array operations, it is concerned with the elements of the arrays, not the arrays themselves as entities. Variables in ModelingToolkit are <: Number. We can do things like @variables x[1:N, 1:M] to make a N x M matrix of variables, but if you tried to push say a 256 x 256 matrix of Variables through the above function it’d take an inordinate amount of time, let alone 10_000 x 10_000. With ModelingToolkit, the time it takes to trace matrix code would scale according to the size of the matrices. In TensorFlow or PyTorch, it wouldn’t (as far as I understand).

As a very very cut down example, consider:

f(A, X) = A .* X
using ModelingToolkit
@variables X[1:5000, 1:5000]
A = rand(size(X)...)
f(A, X)

It’d be great if ModelingToolkit could do things like @variables A::Matrix X::Matrix where the size of the matrix is undetermined and then f(A, X) would just give something loosely like Expression(.*, A, X). Then I see how we could usefully trace through the above code and get optimizations, but it doesn’t seem like ModelingToolkit is in the business of doing those sorts of operations yet.

5 Likes

Yup, the current behavior is a bug that should get fixed.

5 Likes

The paper is now published: Benchmarking machine-learning software and hardware for quantitative economics - ScienceDirect
I think you did a great job in this thread and should consider writing a comment to that paper with all these points of criticism that evolved during rewriting the code.

9 Likes

This is arguably the best thread I have come across on discourse! One year late, but never too late. Time to start optimizing Julia code. :smile: :small_airplane:

2 Likes

I wonder can this code be optimized further now. LoopVectorization.jl moved forward and there is Tullio.jl which probably can help with multithreading issues.

4 Likes

11 posts were split to a new topic: BLAS thread count vs Julia thread count

Has this changed much?
Does more of Base take advantage of multithreading?
Does broadcasting use threads?

Not yet. Broadcasting in particular will take quite a bit of work.

2 Likes

I’d imagine the overhead of “check if this broadcasting is worth multi-threading or not” will slow down numerous small broadcasting everywhere by a (relatively) significant amount.

1 Like

I think the immediate goal is for it to be explicit, i.e. @threads y .= foo.(bar.(x)). That way the caller can decide (by benchmarking) whether it is worth it or not, in the same way that you might decide whether to use @threads for.

11 Likes

That’s really not the part that I’m worried about — that’s as simple as a few type/length checks, which are really quick, especially if the kernels are outlined. The challenge is the “check if this broadcasting is safe to multi-thread or not” accounting for side-effects, aliasing, crazy data structures, and more. Making it explicit is exactly a good first step.

6 Likes

Doesn’t https://github.com/Jutho/Strided.jl do this already?

this looks pretty do-able, do we have a PR/WIP already somewhere?

2 Likes