Trig functions very slow

You can see how I implemented convert here, though it has been a few years and so I don’t remember why exactly I did it that way.
I have some memories that there was nothing in SIMD.jl that was useful to me, so I never used it.

In julia 1.0 terminology typealias a b is written const a=b.

However, I note that what I implemented was type-pirating (or at least dangerously close to).
Though I think there is an equiv using single field structs

Probably worth building julia with LLVM assertions enabled if you get segfaults.

BenchmarkTools creates a new function using gensym (to get a unique symbol) which gives this type of name.

I make a lot of use of SIMD.jl in some of my libraries. While you can unsafe_load and unsafe_store to get vectors of the desired types (eg, __m256d), SIMD.jl already has these convenience functions.
Contradicting the README (which says it does not yet support masked load/store operations), it also supports masked load/store operations.

Therefore using it’s data-type provides a lot of convenience.

SIMD.jl’s data type is a single field struct, which is why all the segfaults using it bewilder me:

const VE = Base.VecElement

export Vec
struct Vec{N,T<:ScalarTypes} <: DenseArray{T,1}   # <: Number
    elts::NTuple{N,VE{T}}
    @inline Vec{N,T}(elts::NTuple{N, VE{T}}) where {N,T} = new{N,T}(elts)
end

If v is an instance of Vec{4,Float64}, then v.elts is an instance of your __m256d.
However, using v.elts throws segfaults. Not wrapping it in the single field struct does not.
Similarly, relying on their convert method:

# Convert vectors to tuples
@generated function Base.convert(::Type{NTuple{N,R}}, v::Vec{N,T}) where {N,R,T}
    quote
        $(Expr(:meta, :inline))
        tuple($([:(R(v.elts[$i].value)) for i in 1:N]...))
    end
end

also segfaults.
I doubt the problem is with SIMD.jl, as I see no reason either of those approaches should behave any differently. Or why directly passing the single field struct should cause problems, either.

Does making a debug build of Julia turn on the LLVM assertions?
If so, I’ll do that tonight and report back.

FWIW, I’ve not seen any segfaults while benchmarking things like this (following yuyichao’s suggestion):

function xlog_ha(v1::Vec{4,Float64})
    Vec{4,Float64}(ccall((:Sleef_logd4_u10,libsleef),
        NTuple{4, VE{Float64}}, (NTuple{4, VE{Float64}},), v1.elts))
end

Interesting. I did try that and got segfaults.

It is possible that the issue is local, and others can’t reproduce.

I didn’t bother testing with vectors of length 4. Turns out, that works!

julia> using SLEEFwrap

julia> using SIMD, BenchmarkTools

julia> randvec(::Val{N}) where N = ntuple(i -> Core.VecElement(rand()), Val(N))
randvec (generic function with 1 method)

julia> SIMD.Vec(x::NTuple{N,SIMD.VE{T}}) where {N,T} = Vec{N,T}(x)

julia> a4 = randvec(Val(4))
(VecElement{Float64}(0.012204276549697468), VecElement{Float64}(0.5756570832359544), VecElement{Float64}(0.28790436098749095), VecElement{Float64}(0.8984593014651698))

julia> va4 = Vec(a4)
4-element Vec{4,Float64}:
Float64⟨0.012204276549697468, 0.5756570832359544, 0.28790436098749095, 0.8984593014651698⟩

julia> a8 = randvec(Val(8))
(VecElement{Float64}(0.1177649743833935), VecElement{Float64}(0.209220421632466), VecElement{Float64}(0.9002632949197731), VecElement{Float64}(0.7208552870686167), VecElement{Float64}(0.8028709546942032), VecElement{Float64}(0.36273577375585764), VecElement{Float64}(0.4345504170952268), VecElement{Float64}(0.38651607365457386))

julia> va8 = Vec(a8)
8-element Vec{8,Float64}:
Float64⟨0.1177649743833935, 0.209220421632466, 0.9002632949197731, 0.7208552870686167, 0.8028709546942032, 0.36273577375585764, 0.4345504170952268, 0.38651607365457386⟩

julia> @btime SLEEFwrap.log($a4)
  13.488 ns (0 allocations: 0 bytes)
(VecElement{Float64}(-4.405968851806169), VecElement{Float64}(-0.5522431371766718), VecElement{Float64}(-1.2451269339014779), VecElement{Float64}(-0.10707386987215738))

julia> @btime SLEEFwrap.log($va4)
  13.638 ns (0 allocations: 0 bytes)
4-element Vec{4,Float64}:
Float64⟨-4.405968851806169, -0.5522431371766718, -1.2451269339014779, -0.10707386987215738⟩

julia> @btime SLEEFwrap.log($a8)
  10.845 ns (-4611372873275237343 allocations: 0 bytes)
(VecElement{Float64}(-2.1390643831869203), VecElement{Float64}(-1.564366933809658), VecElement{Float64}(-0.10506800853134832), VecElement{Float64}(-0.3273168732590404), VecElement{Float64}(-0.21956128194192048), VecElement{Float64}(-1.0140806057515694), VecElement{Float64}(-0.8334433062665966), VecElement{Float64}(-0.9505818240973518))

julia> @btime SLEEFwrap.log($va8)

signal (11): Segmentation fault
in expression starting at no file:0
#_run#14 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336
unknown function (ip: 0x7f550811dd8b)
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
inner at ./none:0
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
jl_apply at /home/chriselrod/Documents/languages/julia-ob/src/julia.h:1559 [inlined]
jl_f__apply at /home/chriselrod/Documents/languages/julia-ob/src/builtins.c:556
jl_f__apply_latest at /home/chriselrod/Documents/languages/julia-ob/src/builtins.c:594
#invokelatest#1 at ./essentials.jl:701 [inlined]
#invokelatest at ./none:0 [inlined]
#run_result#16 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:32 [inlined]
#run_result at ./none:0 [inlined]
#run#18 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:46
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
#run at ./none:0 [inlined]
#run at ./none:0 [inlined]
#warmup#21 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:79 [inlined]
warmup at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:79
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
do_call at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:323
eval_value at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:411
eval_stmt_value at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:362 [inlined]
eval_body at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:750
jl_interpret_toplevel_thunk_callback at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:880
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7f54ce03945f)
unknown function (ip: 0x9)
jl_interpret_toplevel_thunk at /home/chriselrod/Documents/languages/julia-ob/src/interpreter.c:889
jl_toplevel_eval_flex at /home/chriselrod/Documents/languages/julia-ob/src/toplevel.c:818
jl_toplevel_eval_in at /home/chriselrod/Documents/languages/julia-ob/src/builtins.c:622
eval at ./boot.jl:319
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
eval_user_input at /home/chriselrod/Documents/languages/julia-ob/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85
run_backend at /home/chriselrod/.julia/packages/Revise/xO23U/src/Revise.jl:766
#58 at ./task.jl:259
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
jl_apply at /home/chriselrod/Documents/languages/julia-ob/src/julia.h:1559 [inlined]
start_task at /home/chriselrod/Documents/languages/julia-ob/src/task.c:271
unknown function (ip: 0xffffffffffffffff)
Allocations: 50471263 (Pool: 50461103; Big: 10160); GC: 150
Segmentation fault (core dumped)

On a computer with avx-512, it is definitely worth taking advantage of - approximately 2x the speed is really worthwhile.

Just started making the debug build of Julia. Will update once that’s compiled.

EDIT:
The debug build didn’t do the trick - same output:

julia> @btime SLEEFwrap.log($va8)

signal (11): Segmentation fault
in expression starting at no file:0
#_run#10 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336
unknown function (ip: 0x7f6fd03c07cb)
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
jl_apply at /home/chriselrod/Documents/languages/julia-ob/src/julia.h:1559
jl_invoke at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:56
inner at ./none:0

Was hoping the debug build of Julia would turn on LLVM assertions, but I noticed it did not recompile LLVM.
Searching through the Makefile, I don’t see anything I can define in Make.user to turn them on.

This is what I got from gdb:

Thread 1 "julia-debug" received signal SIGSEGV, Segmentation fault.
0x00007fffd060202e in julia_#_run#7_-532980714 (verbose=<optimized out>, pad=<optimized out>, kwargs=..., b=..., p=...)
    at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336
336	            return sort!(trial), return_val
(gdb) bt
#0  0x00007fffd060202e in julia_#_run#7_-532980714 (verbose=<optimized out>, pad=<optimized out>, kwargs=..., b=..., p=...)
    at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336
#1  0x00007fffd060212c in jfptr_#_run#7_-532980713 ()
#2  0x00007ffff78ca69c in jl_fptr_trampoline (m=0x7fffdcc4db90, args=0x7fffffff95a0, nargs=7)
    at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
#3  0x00007ffff78cb814 in jl_apply_generic (args=0x7fffffff95a0, nargs=7) at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
#4  0x00007ffff78c3c39 in jl_apply (args=0x7fffffff95a0, nargs=7) at /home/chriselrod/Documents/languages/julia-ob/src/julia.h:1559
#5  0x00007ffff78c4358 in jl_invoke (meth=0x7fffdcc4db90, args=0x7fffffff95a0, nargs=7) at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:56
#6  0x00007fffd06016f2 in #_run () at iterators.jl:169
#7  japi1_inner_-532980715 () at essentials.jl:700
#8  0x00007ffff78ca738 in jl_fptr_args (m=0x7fffdd7c0f10, args=0x7fffffff9750, nargs=1) at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1853
#9  0x00007ffff78ca69c in jl_fptr_trampoline (m=0x7fffdd7c0f10, args=0x7fffffff9750, nargs=1)
    at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
#10 0x00007ffff78cb814 in jl_apply_generic (args=0x7fffffff9750, nargs=1) at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
#11 0x00007ffff78d7c9b in jl_apply (args=0x7fffffff9750, nargs=1) at /home/chriselrod/Documents/languages/julia-ob/src/julia.h:1559
#12 0x00007ffff78dad89 in jl_f__apply (F=0x0, args=0x7fffffff9928, nargs=1) at /home/chriselrod/Documents/languages/julia-ob/src/builtins.c:556
#13 0x00007ffff78dafbc in jl_f__apply_latest (F=0x0, args=0x7fffffff9928, nargs=1) at /home/chriselrod/Documents/languages/julia-ob/src/builtins.c:594
#14 0x00007fffd05ffe97 in #invokelatest#1 () at essentials.jl:701
#15 0x00007fffd05ffe97 in #invokelatest ()
#16 #run_result#16 () at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:32
#17 0x00007fffd05ffe97 in #run_result ()
#18 japi1_#run#18_-532980727 (kwargs=..., b=..., p=...) at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:46
#19 0x00007ffff78ca738 in jl_fptr_args (m=0x7fffdd38e990, args=0x7fffffff9b48, nargs=5) at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1853

Line 336 in context.

It’s also worth pointing out that I don’t get segfaults while not benchmarking, but I haven’t tested in a large program.

I am strongly considering just making a SIMDPirates.jl, copying SIMD.jl but NTuple{N,VE{T}}s directly rather than the single-field struct wrapper Vec type. Alternatively, maybe a little more work will let me avoid this unnecessary copying.

julia> @code_native SLEEFwrap.exp(xvec)
	.text
; Function exp {
; Location: SLEEFwrap.jl:91
	pushq	%rbx
	movq	%rdi, %rbx
; Function cconvert; {
; Location: essentials.jl:355
; Function convert; {
; Location: SIMD.jl:137
; Function macro expansion; {
; Location: SIMD.jl:139
	vmovups	(%rsi), %zmm0
;}}}
	movabsq	$Sleef_expd8_u10, %rax
	callq	*%rax
	vmovaps	%zmm0, (%rbx)
	movq	%rbx, %rax
	popq	%rbx
	vzeroupper
	retq
	nopw	%cs:(%rax,%rax)
;}

julia> @code_native SLEEFwrap.exp(xv)
	.text
; Function exp {
; Location: SLEEFwrap.jl:96
	pushq	%rax
	movabsq	$Sleef_expd8_u10, %rax
	callq	*%rax
	popq	%rax
	retq
	nop
;}

No need for SIMDPirates to actually pirate; all uses of +, fma, etc will simply have to be qualified.

May I point out that

julia> sin(2π) 
-2.4492935982947064e-16

julia> sin(38π)
-1.1759085194360944e-14

In that context differences in the order of 1e-16 or less are not significant.

cf

julia> sinpi(2)
0.0

julia> sinpi(38)
0.0

If you know you have multiples of \pi and this kind of precision is important for your use case, use sinpi & friends.

2 Likes