Cosine seems slow

I’m finding the cos function slow in Julia compared with Mathematica. For example:

a = RandomReal[NormalDistribution[], {2000, 2000}];
RepeatedTiming[b = Cos[a];]

{0.018, Null}
function test()
       a=randn(2000,2000)
       @time cos.(a)
end
test();

0.172242 seconds (2 allocations: 30.518 MiB, 37.86% gc time)

It looks like Julia is nearly 10x slower than Mathematica. Am I doing something stupid?

The @time macro is not accurate and, in general, includes compile time. One should always use BenchmarkTools.jl for benchmarking

julia> a = randn(2000,2000);

julia> @btime cos.(a);
  53.765 ms (4 allocations: 30.52 MiB)

Not sure why it’s still so slow. It’s likely the case that Mathematica uses MKL by default, so that may be part of it. I also don’t think that currently cos.(a) would do SIMD in Julia, so maybe you can speed it up that way as well.

1 Like

Thank you, with @btime the difference is a factor of 4. So perhaps Mathematica is using 4 cores and Julia just one.

I can’t get @simd to do anything useful, it wants a loop but if I replace the broadcast with an explicit loop it is much slower (1000ms).

Try VML.jl:


Here I get 3X speed up with non-mutating function.

using BenchmarkTools, VML

const a = randn(2000,2000);

# warm up
cos.(a)
VML.cos(a)
julia> @btime cos.(a);
59.888 ms (2 allocations: 30.52 MiB)

julia> @btime VML.cos(a);
21.070 ms (2 allocations: 30.52 MiB)

and more than 7X speed up with mutating function!

julia> @btime VML.cos!(a);
  8.476 ms (0 allocations: 0 bytes)

Just out of curiosity: is the original repository of VML in the JuliaMath organisation abandoned? What’s the reason?

No. We are waiting for the owners to merge our pull request (https://github.com/JuliaMath/VML.jl/pull/17).

BTW, I updated the benchmark data. I get 3X speed up!

Edit: merged now! : https://github.com/JuliaMath/VML.jl

2 Likes
using LoopVectorization, xsimdwrap, BenchmarkTools

function cos_base!(a, b)
    @inbounds for i in eachindex(a, b)
        a[i] = cos(b[i])
    end
end
function cos_sleef!(a, b)
    @vvectorize for i in eachindex(a, b)
        a[i] = cos(b[i])
    end
end
function cos_xsimd!(a, b)
    @xvectorize for i in eachindex(a, b)
        a[i] = cos(b[i])
    end
end

These libraries yield a nice boost:

julia> b = randn(100, 20); # no need to make it so large

julia> a = similar(b);

julia> @benchmark cos_base!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.802 μs (0.00% GC)
  median time:      13.126 μs (0.00% GC)
  mean time:        13.512 μs (0.00% GC)
  maximum time:     31.193 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark cos_sleef!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.942 μs (0.00% GC)
  median time:      1.953 μs (0.00% GC)
  mean time:        2.001 μs (0.00% GC)
  maximum time:     5.392 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark cos_xsimd!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.502 μs (0.00% GC)
  median time:      2.584 μs (0.00% GC)
  mean time:        2.618 μs (0.00% GC)
  maximum time:     6.281 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

EDIT: The 6x speedup doesn’t mean this is faster than VML. The gain will be CPU-specific. This computer will probably see around a 6x boost from VML too.

julia> versioninfo()
Julia Version 1.4.0-DEV.526
Commit 31bf76f8a1* (2019-11-25 22:55 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
5 Likes

If you write the function as non-mutating it would be around 3X

function cos_sleef(inp)
    out = similar(inp)
    @vvectorize for i in eachindex(inp)
        out[i] = cos(inp[i])
    end
    return out
end

Self-mutation:

function cos_sleef_self!(inp)
    @vvectorize for i in eachindex(inp)
        inp[i] = cos(inp[i])
    end
    return inp
end

I tried on my computer using LoopVectorization with non-mutating I get 3X and with mutating I get 6X speedup (which removes memory accessing).

julia> @btime cos.(a);
  59.898 ms (2 allocations: 30.52 MiB)

julia> @btime VML.cos(a);
  21.155 ms (2 allocations: 30.52 MiB)

julia> @btime cos_sleef(a) # non-mutating
  24.875 ms (2 allocations: 30.52 MiB)

julia> @btime cos_sleef!(out,a) # mutating
  12.257 ms (0 allocations: 0 bytes)

julia> @btime cos_sleef_self!(a) # self-mutating
  12.249 ms (0 allocations: 0 bytes)

I couldn’t build xsimdwrap. I got some error.

By the way, why don’t you merge all these packages into a single one? It is kind of confusing :smile: I had to add a bunch of libraries manually to get this working.

julia> @benchmark cos.($a)
BenchmarkTools.Trial: 
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     14.445 μs (0.00% GC)
  median time:      14.776 μs (0.00% GC)
  mean time:        15.341 μs (0.00% GC)
  maximum time:     102.015 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark cos_sleef($b)
BenchmarkTools.Trial: 
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.982 μs (0.00% GC)
  median time:      2.217 μs (0.00% GC)
  mean time:        3.817 μs (8.40% GC)
  maximum time:     188.719 μs (97.31% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark cos_sleef!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.805 μs (0.00% GC)
  median time:      1.858 μs (0.00% GC)
  mean time:        1.875 μs (0.00% GC)
  maximum time:     5.114 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

You forgot to delete the reference to a in eachindex, making it type unstable: for i in eachindex(a, inp) should be for i in eachindex(inp). Mind rerunning the benchmark?

Viral just merged the PR, but I ran into problems installing it (I don’t have MKL.jl installed).

I’m pretty sure my gfortran uses VML, so I’ll test that instead. Compiling

module trig
  use ISO_C_BINDING
  implicit none
contains  
  subroutine vsin(a, b, N) bind(C, name = "vsin")
    integer(c_int64_t), intent(in) :: N
    real(C_double), intent(in), dimension(N) :: b
    real(C_double), intent(out), dimension(N) :: a
    a = sin(b)
  end subroutine vsin
  subroutine vcos(a, b, N) bind(C, name = "vcos")
    integer(c_int64_t), intent(in) :: N
    real(C_double), intent(in), dimension(N) :: b
    real(C_double), intent(out), dimension(N) :: a
    a = cos(b)
  end subroutine vcos
end module trig

with

gfortran -S -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC /home/chriselrod/Documents/progwork/fortran/trig.f90 -o vtrig.s

Yields assembly containing the following loop:

.L4:
	movq	%rbx, %r12
	salq	$6, %r12
	vmovupd	(%r14,%r12), %zmm0
	incq	%rbx
	call	_ZGVeN8v_cos@PLT
	vmovupd	%zmm0, 0(%r13,%r12)
	cmpq	%rbx, %r15
	jne	.L4

Note it is using cos on a zmm (512) bit register.
Producing a shared library with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC /home/chriselrod/Documents/progwork/fortran/trig.f90 -o libvtrig.so

and now

const TRIGFORTRAN = "/home/chriselrod/Documents/progwork/fortran/libvtrig.so";
function cos_gfortran!(a, b)
    ccall(
        (:vcos, TRIGFORTRAN),
        Cvoid, (Ref{Float64},Ref{Float64},Ref{Int}),
        a, b, Ref(length(a))
    )
    a
end
cos_gfortran(a) = cos_gfortran!(similar(a), a)

This yields even better times than I got with SLEEF:

julia> a = randn(100, 20); # no need to make it so large

julia> b = similar(a);

julia> @benchmark cos_gfortran($a)
BenchmarkTools.Trial: 
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.562 μs (0.00% GC)
  median time:      1.811 μs (0.00% GC)
  mean time:        2.251 μs (10.21% GC)
  maximum time:     97.660 μs (92.97% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark cos_gfortran!($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.434 μs (0.00% GC)
  median time:      1.463 μs (0.00% GC)
  mean time:        1.499 μs (0.00% GC)
  maximum time:     4.733 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Putting them all together with btimes:

julia> @btime @. $b = cos($a);
  13.181 μs (0 allocations: 0 bytes)

julia> @btime cos_gfortran!($b, $a); # VML?
  1.467 μs (0 allocations: 0 bytes)

julia> @btime cos_sleef!($b, $a);
  2.048 μs (0 allocations: 0 bytes)

julia> @btime cos_xsimd!($b, $a);
  2.504 μs (0 allocations: 0 bytes)

So gcc is the clear winner here. This probably requires recent versions of the compiler.
Given the roughly 9x magnitude in speed difference, I figure I should point out:

julia> all(cos_gfortran(a) .≈ cos.(a))
true

julia> all(cos_gfortran(1000a) .≈ cos.(1000 .* a))
true

Using the same size inputs as sw72:

julia> a = randn(2000,2000);

julia> b = similar(a);

julia> @btime @. $b = cos($a);
  35.750 ms (0 allocations: 0 bytes)

julia> @btime cos_gfortran!($b, $a);
  4.360 ms (0 allocations: 0 bytes)

julia> @btime cos_sleef!($b, $a);
  5.304 ms (0 allocations: 0 bytes)

julia> @btime cos_xsimd!($b, $a);
  6.901 ms (0 allocations: 0 bytes)

julia> all(cos_gfortran(a) .≈ cos.(a))
true

What was the error installing xsimdwrap? You need git and a C++ compiler (it defaults to trying g++, but clang++ should work too).

I think you can answer why xsimdwrap is separated from the others ;).
I did just rebuild it on two different computers, but not everyone gets to enjoy the easy life of running Linux. Such as myself at work. :frowning:

The libraries all do different things, although VectorizationBase could probably be combined with SIMDPirates.
While developing libraries, splitting them up also cuts down on re-precompile times. Eg, VectorizationBase only precompiles when I actually change it, because it doesn’t depend on any of my other libraries.
I plan on registering them all within the next few months, so that it’ll just require ] add ${NAME}. Some of them, in particular LoopVectorization, are going to see some major rework before that happens.

Also, it would be really cool to get gcc’s implementation of these vector functions ported directly into Julia. I wouldn’t mind a GPL’d library.

5 Likes

Oh I missed it. It doesn’t affect the benchmark result though.

@btime cos.(a);
  63.143 ms (2 allocations: 30.52 MiB)
@btime VML.cos(a);
  20.947 ms (2 allocations: 30.52 MiB)
@btime cos_sleef(a)
  24.701 ms (2 allocations: 30.52 MiB)
@btime cos_sleef!(out,a)
  12.283 ms (0 allocations: 0 bytes)
@btime cos_sleef_onearg!(a)
  12.245 ms (0 allocations: 0 bytes)

You should have MKL installed.

I will create an issue there.

We need to integrate some vectorized library into Julia. I am investigating different libraries to see which one is more suitable (from ease of use, performance, license, and compatibility perspective).

1 Like

I seriously doubt

@benchmark similar($a)

will take 12 ms for a 2000x2000 matrix.

julia> @btime cos.($a);
  36.720 ms (2 allocations: 30.52 MiB)

julia> @btime cos_gfortran($a);
  4.438 ms (2 allocations: 30.52 MiB)

julia> @btime cos_sleef($a);
  4.607 ms (2 allocations: 30.52 MiB)

julia> @btime cos_xsimd($a);
  7.084 ms (2 allocations: 30.52 MiB)

I see negligible difference with @btime between the allocating and in place/non-allocating versions at these sizes (note that @btime reports the minimum time; @benchmark will show the median and especially the right-skewed mean runtimes are higher).

You aren’t showing excess allocations though, so it doesn’t look like a type instability; if I saw results like that on my own computer (where I could investigate), I’d try to find out what is going on. That is bothersome, since you shouldn’t see a performance difference like that – the only difference between the two should be the price of similar(a).

I agree. FWIW, SLEEFPirates is the only one any of my other libraries depend on, because musm’s port is pure Julia, meaning there are no installation issues. It is also often the fastest in loops, because it can inline and therefore it can hoist a lot of constant memory loads out of the loop (eg, all the polynomial coefficients). Obviously for cos on my computer it was much slower than gcc's vector library, but some other functions like exp and log are better optimized.

It seems like ideally LLVM would be able to handle that (just like it’s the gcc backend handling vectorization for them), and the @fastmath version of these functions would then allow vectorization. But I can’t seem to find much work on that front since 2016.

2 Likes

Thanks for all the ideas @Amin_Yahyaabadi and @Elrod

I hope something like this gets integrated into Julia - as a newbie coming from languages like Mathematica and Matlab it seems weird to have to load up a special library to get good performance on basic trig functions. Ultimately it would be nice to use the simple syntax cos.(a) and have it be as fast as Mathematica’s Cos[a] or Matlab’s cos(a).

I have created a couple of issue to track this.


Unfortunately, the progress of the main Julia repository is slow in this case.



@Elrod, How com gfortran is so fast?
Wouldn’t it use GCC’s libm as well?

Could it be that the latest libm of GCC do vectorization and caught up with Intel SVML?

By the way, it can’t use VML as VML is for arrays and not elements. It might use SVML but I’m not sure if you have it on your system.

Sorry, conflated the two libraries as one
It is vectorized (acting on 512 bit zmm registers).

Here is the GPLed source of glibc. That specific link is for avx512 sincos, named svml_d_sincos8_core_avx512.S.

The source file names for vectorized math functions are of the form “svml_{datatype}{func}{vector width}{arch}.S”, and I recall reading on Phoronix that Intel contributed a lot of SIMD math code.
Meaning that SVML, or parts of it, have been open sourced and contributed into glibc itself. Hence, why you don’t need to do any special linking.

For what it’s worth, I benchmarked vs clang with fveclib=SVML, linking the SVML I downloaded alongside MKL, and found it to be slower than gcc.
That may be for other reasons than the implementation of the math functions, but I doubt unrolling decisions (gcc doesn’t, clang does here – IMO gcc makes the right call, since I don’t know what the benefit is supposed to be, given the lack of loop dependencies).

I haven’t looked at it too closely, so lots of explanations are still on the table. Maybe Clang defaults to more accurate versions.

EDIT:
Maybe someone could wrap some of the ASM in llvmcalls, and create a GPL-ed Julia library. It would then still basically be pure-Julia / not require any extra dependencies, although CpuId.jl would probably be worthwhile to figure out which of the functions are valid on the given architecture.

Also, VML works!

julia> using VML, BenchmarkTools

julia> b = randn(100, 20); a = similar(b);

julia> @benchmark VML.cos!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.397 μs (0.00% GC)
  median time:      1.408 μs (0.00% GC)
  mean time:        1.449 μs (0.00% GC)
  maximum time:     3.563 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> const TRIGFORTRAN = "/home/chriselrod/Documents/progwork/fortran/libvtrig.so";

julia> function cos_gfortran!(a, b)
           ccall(
               (:vcos, TRIGFORTRAN),
               Cvoid, (Ref{Float64},Ref{Float64},Ref{Int}),
               a, b, Ref(length(a))
           )
           a
       end
cos_gfortran! (generic function with 1 method)

julia> cos_gfortran(a) = cos_gfortran!(similar(a), a)
cos_gfortran (generic function with 1 method)

julia> @benchmark cos_gfortran!($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.375 μs (0.00% GC)
  median time:      1.393 μs (0.00% GC)
  mean time:        1.432 μs (0.00% GC)
  maximum time:     4.393 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

About as fast as compiling with gcc.

1 Like