Cosine seems slow

Sorry, conflated the two libraries as one
It is vectorized (acting on 512 bit zmm registers).

Here is the GPLed source of glibc. That specific link is for avx512 sincos, named svml_d_sincos8_core_avx512.S.

The source file names for vectorized math functions are of the form “svml_{datatype}{func}{vector width}{arch}.S”, and I recall reading on Phoronix that Intel contributed a lot of SIMD math code.
Meaning that SVML, or parts of it, have been open sourced and contributed into glibc itself. Hence, why you don’t need to do any special linking.

For what it’s worth, I benchmarked vs clang with fveclib=SVML, linking the SVML I downloaded alongside MKL, and found it to be slower than gcc.
That may be for other reasons than the implementation of the math functions, but I doubt unrolling decisions (gcc doesn’t, clang does here – IMO gcc makes the right call, since I don’t know what the benefit is supposed to be, given the lack of loop dependencies).
https://github.com/JuliaMath/VML.jl/issues/22#issuecomment-558876268

I haven’t looked at it too closely, so lots of explanations are still on the table. Maybe Clang defaults to more accurate versions.

EDIT:
Maybe someone could wrap some of the ASM in llvmcalls, and create a GPL-ed Julia library. It would then still basically be pure-Julia / not require any extra dependencies, although CpuId.jl would probably be worthwhile to figure out which of the functions are valid on the given architecture.

Also, VML works!

julia> using VML, BenchmarkTools

julia> b = randn(100, 20); a = similar(b);

julia> @benchmark VML.cos!($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.397 μs (0.00% GC)
  median time:      1.408 μs (0.00% GC)
  mean time:        1.449 μs (0.00% GC)
  maximum time:     3.563 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> const TRIGFORTRAN = "/home/chriselrod/Documents/progwork/fortran/libvtrig.so";

julia> function cos_gfortran!(a, b)
           ccall(
               (:vcos, TRIGFORTRAN),
               Cvoid, (Ref{Float64},Ref{Float64},Ref{Int}),
               a, b, Ref(length(a))
           )
           a
       end
cos_gfortran! (generic function with 1 method)

julia> cos_gfortran(a) = cos_gfortran!(similar(a), a)
cos_gfortran (generic function with 1 method)

julia> @benchmark cos_gfortran!($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.375 μs (0.00% GC)
  median time:      1.393 μs (0.00% GC)
  mean time:        1.432 μs (0.00% GC)
  maximum time:     4.393 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

About as fast as compiling with gcc.

1 Like