JuliaPro 1.0.1.1 is available, but no MKL?

how to ask them to build the MKL version?

2 Likes

There was a table showing free and commercial version with the latter with MKL. Or I seem to remember that, and no longer find it.

There seems to be only one version without MKL (as of JuliaPro 1.0). I donā€™t know for sure why [EDIT: it may not be needed, at least with the latest OpenBLAS, see see my third post in this thread why and how to use it.]. Maybe itā€™s no longer considered needed with Julia/OpenBLAS now faster? Or possibly itā€™s just easy to get MKL separately? Before you needed a non-GPL version of Julia and now itā€™s close to it or already there. Possibly even if Julia is not yet GPL-free, possibly JuliaPro is (as itā€™s strictly speaking proprietary software built on open source)? And in either case, youā€™re always allowed to add propriatary, e.g. MKL, in your installation, even adding to GPL, then only youā€™re not allowed to distribute the whole later.

You can also download older versions of JuliaPro, possibly one of is MKL (i.e. still available online) if that works for you temporarily.

The discussion ends like (and I believe SuiteSparse is no longer a dependeny of Julia):

If MKL and OpenBLAS are ABI-compatible, then LD_PRELOAD should do the trick. If they are not ABI-compatible, the one can [ā€¦] As MKL ships an FFTW-compatible interface, it sounds like SuiteSparse is your only mandatary GPL dependency interfering with MKL binary redistribution.
[ā€¦]

JuliaPro-0.6.2.2 ā€“ MKL (for Windows) - (762.17M)
JuliaPro-0.6.2.2 ā€“ MKL (for Linux) - (1.02G)
JuliaPro 0.6.2.2 - MKL (for Mac) - (2.47G)
JuliaPro-0.6.2.2 ā€“ MKL (for Linux) ā€“ ASC - (490.00B)

[ā€¦]
Sorry to bring up an old thread. I am curious as to what changed in order for Julia to be able to distribute both an MKL and non-MKL version?

[ā€¦]

Please discuss this on discourse.

I am locking this in order to not discuss this further here, and redirecting folks to discourse.

Possibly some of the googleble solutions are helpful (what applies to 0.7, should also to Julia 1.0, and also JuliaPro I think):

OpenBLAS is still incredibly slow compared to MKL on any processor with avx-512, so itā€™s not that it has improved (although OpenBLAS is actively adding kernels).

I have the impression however that MKL is no longer supported. Arpack.jl depends on OpenBLAS, and many packages depend on Arpack, including Distributions.jl and LightGraphs.jl if you use MKL. Both of these are supported by JuliaPro.

2 Likes

besides the lack of MKL build, I encountered problems calling Pkg.add() in JuliaPro 1.0.1.1 (something like ā€œAuthentication requiredā€) ā€¦

it makes me going back to the standard Julia 1.0.1

I believe thereā€™s a workaround, but since you get essentially the same that way (and no MKL either way it seems; you need to add) and by adding Juno, it seem like a plan:

https://juliacomputing.com/blog/2018/10/16/juliapro.html

The new JuliaPro releases (based on Julia 1.0) therefore do not bundle packages any more. The downloadable distributions contain only the compiler, the standard library, and the Juno IDE.

Even though the packages are not bundled, JuliaPro users still benefit from a curated set of packages. This is provided through the JuliaPro package registry hosted by Julia Computing. Incidentally, this registry is also used to provide the same supported packages on JuliaBox.

The JuliaPro registry contains a subset of packages from Juliaā€™s General registry, but with an additional layer of testing and curation. The list of packages supported by the JuliaPro registry is displayed on the JuliaPro product page. Users can change to the General registry through a manual process.

OpenBLAS got AVX-512 support in latest August 2018 0.3.3 version.

Itā€™s not bundled with latest stable Julia 1.0.1 but support was merged 8 days ago, (so I expect it in Julia 1.0.2 or at least it) should be included in:

https://julialang.org/downloads/nightlies.html

I also think you can use any OpenBLAS if you have it (and dynamically link), but I may be wrong regarding that or who easy it is (is OpenBLAS statically linked by default?).

Besides, since a long time ago:

http://www.tomshardware.co.uk/answers/id-3685153/threadripper-support-avx-512-perform-7900x.html

Looking at the Julia source code for AVX512 I found ā€œHasAVX512ā€ (or strictly in a patch for LLVM, i.e. not directly related to OpenBLAS, so Iā€™m curious what the support is):

1 Like

Yes, you can build OpenBLAS as a system BLAS and link it. I described how to do it in the opening post, although Iā€™ve since just let Juliaā€™s build system handle all that.

The i9 7900X + MKL is about 4x faster for matrix multiplication than the Threadripper 1950X.
If you want number crunching power on the CPU and use optimized libraries / compile your own numerical code for the CPU (probably most folks using Julia), AVX512 is the way to go.

julia> versioninfo()
Julia Version 1.1.0-DEV.631
Commit 0fde275eff (2018-11-06 16:09 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools, LinearAlgebra, StaticArrays

julia> W = @SMatrix randn(8,8);

julia> X = @SMatrix randn(8,8);

julia> @benchmark $W * $X
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.950 ns (0.00% GC)
  median time:      13.664 ns (0.00% GC)
  mean time:        13.603 ns (0.00% GC)
  maximum time:     46.453 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> C, A, B = randn(5000,5000), randn(5000,5000), randn(5000,5000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     281.320 ms (0.00% GC)
  median time:      281.859 ms (0.00% GC)
  mean time:        282.021 ms (0.00% GC)
  maximum time:     284.623 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

VS
This comparison is unfortunately unfair. I have 10 processes running at 100% on the Threadripper that I will not kill, and Iā€™m likely to start more once they finish.

julia> versioninfo()
Julia Version 1.1.0-DEV.631
Commit 0fde275eff (2018-11-06 16:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)

julia> using BenchmarkTools, LinearAlgebra, StaticArrays

julia> W = @SMatrix randn(8,8);

julia> X = @SMatrix randn(8,8);

julia> @benchmark $W * $X
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     46.758 ns (0.00% GC)
  median time:      48.259 ns (0.00% GC)
  mean time:        48.567 ns (0.00% GC)
  maximum time:     89.399 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> C, A, B = randn(5000,5000), randn(5000,5000), randn(5000,5000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.009 s (0.00% GC)
  median time:      2.039 s (0.00% GC)
  mean time:        2.062 s (0.00% GC)
  maximum time:     2.137 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

Unburdened, I think it is closer to 1.2 seconds. If I remember, Iā€™ll update it the next time Iā€™m not running other processes.

I say ā€œprobably most folksā€, because

  1. Thereā€™s increasing interest in Julia as a generic language.
  2. While AVX-512 greatly increases your CPUā€™s throughput/$, itā€™s not on the same level as a GPU. A Vega 64 can multiply 5000x5000 (single precision) matrices in around 20ms, vs about 150 and 600 ms (single precision) for the 7900X and 1950X. If you can offload your vectorizable number crunching to your GPUā€¦
  3. Some code doesnā€™t actually like to optimize well. Even what should be highly vectorizable Stan models, seem similarly fast per core on both CPUs, rather than 2-4x faster on the 7900X. In a less than perfect world, most of your time is probably spent running poorly optimized / vectorized code. In that case, more cores is better.

One day Iā€™ll learn how to write optimized code for the GPU! But not yet. Maybe after Julia 1.x starts supporting AMD graphics cards.

1 Like