How to speed up LoopVectorized code - 3x slower than C++ code

using StaticArrays, Parameters, LinearAlgebra, BenchmarkTools, StructArrays, LoopVectorization

const T = Float32
const Point = SVector{3, T}

@with_kw struct Ray @deftype Point
    origin = zeros(Point)
    direction = Point(0, 1, 0)
    # @assert norm(direction) ≈ 1
end

struct Sphere
    centre::Point
    radius::T
end

@with_kw struct Scene{T}
    Sphere::T = []
end

function StructArrays.staticschema(::Type{Point})
    # Define the desired names and eltypes of the "fields"
    return NamedTuple{(:x, :y, :z), fieldtypes(Point)...}
end

StructArrays.component(m::Point, key::Symbol) = getproperty(m, key)
StructArrays.createinstance(::Type{Point}, args...) = Point(args)

function findSceneIntersection(ray, hittable_list, tmin::T, tmax::T)
    besti::Int32 = 0

    @turbo for i in eachindex(hittable_list.Sphere)
        cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
        coy = hittable_list.Sphere.centre.y[i] - ray.origin.y 
        coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

        neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
        c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

        quarter_discriminant = neg_half_b^2 - c
        sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

        root = neg_half_b - sqrtd
        root2 = neg_half_b + sqrtd

        t = ifelse(quarter_discriminant < 0, tmax, 
                ifelse(tmin < root < tmax, root, 
                    ifelse(tmin < root2 < tmax, root2, tmax)))

        newMinT = t < tmax
        tmax = ifelse(newMinT, t, tmax)
        besti = ifelse(newMinT, i, besti)
    end    

    return tmax, besti
end

N = 300
spheres = Sphere.([Point(1, 1, 1) * v for v in range(-100, 100, N)], range(5, 50, N))
scene = Scene(StructArray(spheres, unwrap = T -> !(T<:Real)));
ray = Ray()

@benchmark findSceneIntersection($ray, $scene, $(T(1e-4)), $(T(Inf)))

The C++ code for running the benchmark is Renderer/simdBenchmarking.cpp at SIMDTesting · Zentrik/Renderer (github.com) and the findSceneIntersection function is Renderer/hittable_list.hpp at SIMDTesting · Zentrik/Renderer (github.com).
Right now the findSceneIntersection function takes about 190ns compared to C++ code which takes 60ns.
One difference is the C++ code has an if quarter_discriminant > 0 which gives a ~2x speed improvement in C++, as branching let’s us skip expensive code. I’m not sure how to do that with LoopVectorization though.
I want to write a function like this, where I add an if and changed t and newMinT:

function findSceneIntersection(ray, hittable_list, tmin::T, tmax::T)
    besti::Int32 = 0

    @turbo for i in eachindex(hittable_list.Sphere)
        cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
        coy = hittable_list.Sphere.centre.y[i] - ray.origin.y 
        coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

        neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
        c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

        quarter_discriminant = neg_half_b^2 - c

        if quarter_discriminant > 0
            sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0
    
            root = neg_half_b - sqrtd
            root2 = neg_half_b + sqrtd

            t = ifelse(root > tmin, root, root2)

            newMinT = tmin < t < tmax
            tmax = ifelse(newMinT, t, tmax)
            besti = ifelse(newMinT, i, besti)
        end
    end    

    return tmax, besti
end
1 Like

Afaik, chained comparisons create branches. Perhaps LV does not like that? What happens if you do (tmin < root) & (root < tmax) etc?

2 Likes

As far as I can tell from @code_llvm, the original code doesn’t create a branch. Making the changes you suggested doesn’t make a speed difference.
What I want is some way to create a branch but I don’t know if LoopVectorization even supports that.

1 Like

I’m not sure that StructArrays is helping here. They’re useful when you want to perform operations field-wise on a struct-array/array-of-struct, but less so when you want to operate struct-wise like in your example where you access the x,y,z,radius fields all concurrently. As it is, you’re looping through 4 different vectors in parallel. There’s still a chance the StructArrays play better with vectorization, but maybe not.

I’m also not sure this loop benefits massively from LoopVectorization. There’s enough going on (and some of it somewhat vectorizable already) that it’s at least worth trying without @turbo. If the C++ is much faster with the conditional execution (which precludes vectorization), it’s likely that the Julia will be too.

At the cost of some extra storage and a second loop, you might be able to vectorize the computation of of neg_half_b and c and then run a separate non-vectorized loop to compute tmax,besti from those results (with the branch on quarter_discriminant). I can’t say whether this will be better or not than doing it all in one pass, however.

It appears that you could use @fastmath for most/all of the math here without adverse consequences. It will likely save a couple of instructions in the linear algebra and be more accurate rather than less (thanks to fma).

Not using StructArray didn’t seem to make a speed difference, but it meant I couldn’t use @turbo which gives a significant speed boost.

The C++ code is vectorised using intrinsics, I’m using a library which handles that, perhaps I should have made that clearer. The if doesn’t prevent vectorisation at all, the assembly is roughly the same as without the if except there is a conditional jump to skip over some expensive computation. See Compiler Explorer, if you search for sqrtps you should find the relevant code, you’ll notice all the instructions are vectorized and there’s a jump on the line preceding the sqrt.

Removing @turbo slowed the code down to 1.5 microseconds, @code_llvm shows that only 2 wide vectors are used compared to 8 wide using @turbo. Adding the branch significantly improves it down to 690ns, but still much slower than @turbo.

Running two separate loops as you suggested is slower than just one vectorized loop at 280ns.

My understanding is that @turbo adds @fastmath, I can see fmas in the llvm when just using @turbo and adding @fastmath didn’t make a difference.

2 Likes

If you feel like implementing it similarly to the C++ code you could try GitHub - eschnett/SIMD.jl: Explicit SIMD vector operations for Julia. Not sure about horizontal_find_first but the rest should be available.

Yh, I was hoping I could avoid having to do it explicitly using SIMD.jl or VectorizationBase.jl.

Just FYI: when writing the names of macros, you should wrap them in backticks: `@turbo` Otherwise you run the risk of pinging users named “turbo”.

2 Likes

if you remove different parts of the for-loop content you can eventually find out which part Julia is doing much worse right?

I think most of the ray blah blah doesn’t matter, the ray parameters are constant, basically you just have a looper over 3 equal-length array and doing some calculation and min reduction

if we can have more identical code on Julia and C++ it would be easier to compare asm

2 Likes

I don’t know SIMD details at all, but I think this is the C++ section that Zentrik is referring to (I removed less relevant lines for brevity):

        for (int i = 0; i < (int)radius.size(); i++)
        {
... some other vectors....
            Vec8fb isDiscriminantPositive = quarter_discriminant > Vec8f(0.0f);

            // if ray hits any of the 8 spheres
            if (horizontal_or(isDiscriminantPositive)) // Branching gives 2x speedup using sse (i.e. Vec4f but with Aras' code)
            {
...expensive computation...
            }
            curId += Vec8ui(Vec8f::size()); // easy way to keep track of which chunk we are on
        }

So the conditional branch is occurring between vectors, in this case deciding whether to skip code for each vector. Elementwise branches cannot be changed to vectorwise branches in general e.g. if you’re choosing A or B elementwise, you don’t want a A-B-A-B-A-B-A-B to be turned to AAAA-BBBB. I don’t know enough about LoopVectorization to know if it can potentially turn some elementwise branching to vectorwise branching (incidentally illustrated in this section of the Automatic vectorization Wikipedia page: Reducing_vectorization_overhead_in_the_presence_of_control_flow).

My own question is: is besti = ifelse(newMinT, i, besti) automatically vectorizable and does it have the same result as the manually vectorized C++ code? Your C++ code has a whole section of its own after the loop to select the particular result out of an 8-length vector, and I’m curious if LoopVectorization can do that automatically and correctly.

1 Like
#include <benchmark/benchmark.h>
#include <vector>

#include "version2/vectorclass.h"

constexpr const float infinity = (std::numeric_limits<float>::max)(); // idk how well infinity plays with fast math

class hittable_list {
    public:
        std::vector<Vec8f> radius;

        hittable_list() {}

        void add(float r)
        {
            if (radius.empty() || radius.back()[Vec8f::size() - 1] != 0) {
                radius.push_back(Vec8f(r, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
            } else {
                int i = horizontal_find_first(radius.back() == Vec8f(0.0f));

                radius.back().insert(i, r);
            }
        }

        float hit(float t_max) const
        {
            Vec8f hitT(t_max);

            for (int i = 0; i < (int)radius.size(); i++)
            {
                Vec8f quarter_discriminant = radius[i] - radius[i] * radius[i];
                Vec8fb isDiscriminantPositive = quarter_discriminant > Vec8f(0.0f);

                if (horizontal_or(isDiscriminantPositive))
                {
                    Vec8f t = sqrt(sqrt(quarter_discriminant));
                    t = select(sqrt(t) > 0, t, sqrt(t));
                    Vec8fb msk = isDiscriminantPositive & (t < hitT);

                    hitT = select(msk, t, hitT);
                }
            }

            return horizontal_min(hitT);;
        }
};

hittable_list setup() {
    hittable_list scene; 

    int N = 296;

    for (int i = 0; i < N; i++) {
        float r = 0 + i * (2 - 0) / ((float)N - 1.);
        scene.add(r);
    }

    return scene;
}

template <class ...Args>
static void runbench(benchmark::State& state, Args&&... args) {
    auto args_tuple = std::make_tuple(std::move(args)...);
    hittable_list scene = std::get<0>(args_tuple);

    for (auto _ : state) {
        scene.hit(infinity);
    }
}

BENCHMARK_CAPTURE(runbench, scene interesction, setup());
BENCHMARK_MAIN();
using BenchmarkTools, LoopVectorization

const T = Float32

function findSceneIntersection(radius, tmax::T)
    @turbo for i in eachindex(radius)
        quarter_discriminant = radius[i] - radius[i] * radius[i]
        t = ifelse(quarter_discriminant > 0, sqrt(sqrt(quarter_discriminant)), tmax)
        t = ifelse(sqrt(t) > 0, t, sqrt(t))
        tmax =  ifelse(t < tmax, t, tmax)
    end    

    return tmax
end

N = 296
radius = T.(LinRange(0, 2, N))

findSceneIntersection(radius, T(Inf))
@benchmark findSceneIntersection($radius, $(T(Inf)))

@code_llvm debuginfo=:none findSceneIntersection(radius, T(Inf))
@code_native debuginfo=:none syntax=:intel findSceneIntersection(radius, T(Inf))

On my slower machine, the c++ code takes 175ns whilst the julia code takes 290ns, the multiple sqrt and if are just to exaggerate the difference between c++ and julia code, so you can remove if need be.
These should have simpler assembly.

Heres a link to the version2 library, vectorclass/version2: Vector class library, latest version (github.com), you just need to clone it and then give the correct path. Here’s the link to the benchmark library for the C++ code, google/benchmark: A microbenchmark support library (github.com).

I’ll look more closely into where the rest of the performance difference between the original codes are tomorrow.

The julia code is giving the same results, I would have to look at the llvm in more detail to see how exactly its doing it but I believe LoopVectorization is automatically selecting the minimum result out of the vector.

1 Like

Trying to run the Julia code on 1.8.5, I get an error about sqrt of negative values. It seems like a reasonable error since ifelse evaluates all arguments, so the sqrt of negatives is also evaluated.
How is this running on your setup?

@fastmath makes sqrt return 0 if you pass negative values, using @turbo has the same effect.

1 Like

It should return NaN.

LoopVectorization currently doesn’t support actual branches, but you can hide them.

using LoopVectorization, VectorizationBase

@inline function maybecompute(
  quarter_discriminant::Vec{W,T},
  tmax::Vec{W,T}
) where {W,T}
  m = quarter_discriminant > 0
  !VectorizationBase.vany(m) && return tmax
  t = ifelse(m, sqrt(sqrt(quarter_discriminant)), tmax)
  return ifelse(sqrt(t) > 0, t, sqrt(t))
end
@inline function maybecompute(x::VecUnroll, y::VecUnroll)
  VecUnroll(
    VectorizationBase.fmap(
      maybecompute,
      VectorizationBase.data(x),
      VectorizationBase.data(y)
    )
  )
end
function findSceneIntersection_maybecompute(radius, tmax::T)
  @turbo for i in eachindex(radius)
    quarter_discriminant = radius[i] - radius[i] * radius[i]
    t = maybecompute(quarter_discriminant, tmax)
    tmax = ifelse(t < tmax, t, tmax)
  end
  return tmax
end

This is still slower than what I got from the C++ version, where clang took 75ns and gcc took a suspicious 17.3 ns.

Running /home/chriselrod/Documents/progwork/cxx/Repos/Renderer/c++/bench2gcc
Run on (36 X 4190.78 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x18)
  L1 Instruction 32 KiB (x18)
  L2 Unified 1024 KiB (x18)
  L3 Unified 25344 KiB (x1)
Load Average: 0.96, 1.46, 2.82
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
runbench/scene interesction       17.3 ns         17.3 ns     40945414
julia> @benchmark findSceneIntersection_maybecompute($radius, $(T(Inf)))
BenchmarkTools.Trial: 10000 samples with 961 evaluations.
 Range (min … max):  85.970 ns … 106.797 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     86.086 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   86.290 ns ±   0.680 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁█▇▂                                       ▁ ▁▁▁▁▁          ▂
  ▅████▇▅▄▄▁▁▁▁▃▁▃▁▁▃▃▁▁▁▁▁▃▄▃▁▃▃▁▁▁▁▁▃▁▅▅▆▆▇███████████▇█▇▇▇▆ █
  86 ns         Histogram: log(frequency) by time      88.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The original LoopVectorization.jl version took 155ns on my computer.

2 Likes

Perhaps gcc/clang understood they can elide:

The original c++ code I was using to benchmark was optimising away some stuff, so fixing that the timings were about 76ns for gcc and 80ns for clang.
This julia code using a fork of SIMD.jl (using same boilerplate as in the first post) matched clang when running on 1.9.,0-rc2

const N = 8

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
   s = """
   %mask = trunc <$N x i8> %0 to <$N x i1>
   %res = bitcast <$N x i1> %mask to i$N
   ret i$N %res
   """
   return :(
       $(Expr(:meta, :inline));
       Base.llvmcall($s, UInt8, Tuple{SIMD.LVec{N, Bool}}, mask.data)
   )
end

function hor_min(x::SIMD.Vec{8, T}) where T
   @fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
   @fastmath b = min(a, x)
   @fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
   @fastmath b = min(a, b)
   @fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
   @fastmath b = min(a, b)
   return @inbounds b[1]
end

@fastmath function findSceneIntersection(r, hittable_list, tmin, tmax)
   hitT = SIMD.Vec{N, T}(tmax)
   laneIndices = SIMD.Vec{N, Int32}(Int32.((1, 2, 3, 4, 5, 6, 7, 8)))
   minIndex = SIMD.Vec{N, Int32}(0)

   @inbounds @fastmath for lane in LoopVecRange{N}(hittable_list.Sphere, unsafe=true)
       cox = hittable_list.Sphere.centre.x[lane] - r.origin.x
       coy = hittable_list.Sphere.centre.y[lane] - r.origin.y
       coz = hittable_list.Sphere.centre.z[lane] - r.origin.z

       neg_half_b = r.direction.x * cox + r.direction.y * coy 
       neg_half_b += r.direction.z * coz

       c = cox*cox + coy*coy 
       c += coz*coz 
       c -= hittable_list.Sphere.radius[lane] * hittable_list.Sphere.radius[lane]

       quarter_discriminant = neg_half_b^2 - c
       isDiscriminantPositive = quarter_discriminant > 0

       if any(isDiscriminantPositive)
           @fastmath sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0
   
           root = neg_half_b - sqrtd
           root2 = neg_half_b + sqrtd

           t = vifelse(root > tmin, root, root2)

           newMinT = isDiscriminantPositive & (tmin < t) & (t < hitT)
           hitT = vifelse(newMinT, t, hitT)
           minIndex = vifelse(newMinT, laneIndices, minIndex)
       end

       laneIndices += N
   end

   minHitT = hor_min(hitT)

   if minHitT < tmax
       @inbounds i = minIndex[trailing_zeros(getBits(hitT == minHitT)) + 1]

       return minHitT, i
   else 
       return minHitT, 0
   end
end

The version using LoopVectorization.jl took about 100-110ns I believe

@inline function maybecompute(
    neg_half_b::VectorizationBase.Vec{W,T},
    quarter_discriminant::VectorizationBase.Vec{W,T},
    tmax::VectorizationBase.Vec{W,T},
    tmin::VectorizationBase.Vec{W,T}
) where {W,T}
    m = quarter_discriminant > 0
    !VectorizationBase.vany(m) && return tmax

    sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give NaN

    root = neg_half_b - sqrtd
    root2 = neg_half_b + sqrtd

    t = ifelse(root > tmin, root, root2)
    t = ifelse(m & (tmin < t), t, tmax)

    return t
end

@inline function maybecompute(x::VecUnroll, y::VecUnroll, z::VecUnroll, tmin::VectorizationBase.Vec{W,T}) where {W, T}
  VecUnroll(
    VectorizationBase.fmap(
      maybecompute,
      VectorizationBase.data(x),
      VectorizationBase.data(y),
      VectorizationBase.data(z),
      tmin,
    )
  )
end

function findSceneIntersection_maybecompute(ray, hittable_list, tmin::T, tmax::T)
    besti::Int32 = 0
    x = VectorizationBase.Vec{8,T}(tmin)

    @turbo for i in eachindex(hittable_list.Sphere)
        cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
        coy = hittable_list.Sphere.centre.y[i] - ray.origin.y 
        coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

        neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
        c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

        quarter_discriminant = neg_half_b^2 - c
        t = maybecompute(neg_half_b, quarter_discriminant, tmax, x)
        
        newMinT = t < tmax
        tmax = ifelse(newMinT, t, tmax)
        besti = ifelse(newMinT, i, besti)
    end    

    return tmax, besti
end
1 Like

Have a runable example? I’m curious why LV would be that much slower. I don’t think maybecompute should be that bad.

using StaticArrays, LinearAlgebra, BenchmarkTools, StructArrays, LoopVectorization, VectorizationBase

const T = Float32
const Point = SVector{3, T}

@kwdef struct Ray
    origin::Point = zeros(Point)
    direction::Point = Point(0, 1, 0)
end

struct Sphere
    centre::Point
    radius::T
end

@kwdef struct Scene{T}
    Sphere::T = []
end

function StructArrays.staticschema(::Type{Point})
    # Define the desired names and eltypes of the "fields"
    return NamedTuple{(:x, :y, :z), fieldtypes(Point)...}
end

StructArrays.component(m::Point, key::Symbol) = getproperty(m, key)
StructArrays.createinstance(::Type{Point}, args...) = Point(args)

N = 296
spheres = Sphere.([Point(1, 1, 1) * v for v in range(-100, 100, N)], range(5, 50, N))
scene = Scene(StructArray(spheres, unwrap = T -> !(T<:Real)));
ray = Ray()

@inline function maybecompute(
    neg_half_b::VectorizationBase.Vec{W,T},
    quarter_discriminant::VectorizationBase.Vec{W,T},
    tmax::VectorizationBase.Vec{W,T},
    tmin::VectorizationBase.Vec{W,T}
) where {W,T}
    m = quarter_discriminant > 0
    !VectorizationBase.vany(m) && return tmax

    sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give NaN

    root = neg_half_b - sqrtd
    root2 = neg_half_b + sqrtd

    t = ifelse(root > tmin, root, root2)
    t = ifelse(m & (tmin < t), t, tmax)

    return t
end

@inline function maybecompute(x::VecUnroll, y::VecUnroll, z::VecUnroll, tmin::VectorizationBase.Vec{W,T}) where {W, T}
  VecUnroll(
    VectorizationBase.fmap(
      maybecompute,
      VectorizationBase.data(x),
      VectorizationBase.data(y),
      VectorizationBase.data(z),
      tmin,
    )
  )
end

function findSceneIntersection_maybecompute(ray, hittable_list, tmin::T, tmax::T)
    besti::Int32 = 0
    x = VectorizationBase.Vec{8,T}(tmin)

    @turbo for i in eachindex(hittable_list.Sphere)
        cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
        coy = hittable_list.Sphere.centre.y[i] - ray.origin.y 
        coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

        neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
        c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

        quarter_discriminant = neg_half_b^2 - c
        t = maybecompute(neg_half_b, quarter_discriminant, tmax, x)
        
        newMinT = t < tmax
        tmax = ifelse(newMinT, t, tmax)
        besti = ifelse(newMinT, i, besti)
    end    

    return tmax, besti
end

findSceneIntersection_maybecompute(ray, scene, T(1e-4), T(Inf))
@benchmark findSceneIntersection_maybecompute($ray, $scene, $(T(1e-4)), $(T(Inf)))

using SIMD

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
    s = """
    %mask = trunc <$N x i8> %0 to <$N x i1>
    %res = bitcast <$N x i1> %mask to i$N
    ret i$N %res
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, UInt8, Tuple{SIMD.LVec{N, Bool}}, mask.data)
    )
end
 
function hor_min(x::SIMD.Vec{8, T}) where T
    @fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
    @fastmath b = min(a, x)
    @fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    @fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    return @inbounds b[1]
end

SIMD.Intrinsics.add(x::NTuple{8, VecElement{Int32}}, y::NTuple{8, VecElement{Int32}}, ::SIMD.Intrinsics.FastMathFlags{128}) = SIMD.Intrinsics.add(x, y)

@fastmath function findSceneIntersection(r, hittable_list, tmin, tmax)
    N = 8
    hitT = SIMD.Vec{N, T}(tmax)
    laneIndices = SIMD.Vec{N, Int32}(Int32.((1, 2, 3, 4, 5, 6, 7, 8)))
    minIndex = SIMD.Vec{N, Int32}(0)

    lane = VecRange{N}(1)
    @inbounds @fastmath while lane.i <= length(hittable_list.Sphere)
        cox = hittable_list.Sphere.centre.x[lane] - r.origin.x
        coy = hittable_list.Sphere.centre.y[lane] - r.origin.y
        coz = hittable_list.Sphere.centre.z[lane] - r.origin.z

        neg_half_b = r.direction.x * cox + r.direction.y * coy 
        neg_half_b += r.direction.z * coz

        c = cox*cox + coy*coy 
        c += coz*coz 
        c -= hittable_list.Sphere.radius[lane] * hittable_list.Sphere.radius[lane]

        quarter_discriminant = neg_half_b^2 - c
        isDiscriminantPositive = quarter_discriminant > 0

        if any(isDiscriminantPositive)
            @fastmath sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

            root = neg_half_b - sqrtd
            root2 = neg_half_b + sqrtd

            t = vifelse(root > tmin, root, root2)

            newMinT = isDiscriminantPositive & (tmin < t) & (t < hitT)
            hitT = vifelse(newMinT, t, hitT)
            minIndex = vifelse(newMinT, laneIndices, minIndex)
        end

        laneIndices += N
        lane += N
    end

    minHitT = hor_min(hitT)

    if minHitT < tmax
        @inbounds i = minIndex[trailing_zeros(getBits(hitT == minHitT)) + 1]

        return minHitT, i
    else 
        return minHitT, Int32(0)
    end
end

findSceneIntersection(ray, scene, T(1e-4), T(Inf))
@benchmark findSceneIntersection($ray, $scene, $(T(1e-4)), $(T(Inf)))

The median/mean time for findSceneIntersection_maybecompute was 105ns and 82ns for findSceneIntersection. I don’t really see why findSceneIntersection_maybecompute is slower, maybe the reduction from the vector of besti to a scalar is slow though I would be surprised if it was 20ns slower. There’s also that findSceneIntersection_maybecompute has an extra ifelse, I’m a bit confused by what’s going on in the llvm ir for that

L194:                                             ; preds = %L185, %L127
  %value_phi24 = phi <8 x float> [ %res.i320, %L185 ], [ %value_phi1454, %L127 ]
  %m.i317 = fcmp reassoc nsz arcp contract ogt <8 x float> %res.i336, zeroinitializer
  %res.i316 = sext <8 x i1> %m.i317 to <8 x i32>
  %res.i314 = bitcast <8 x i32> %res.i316 to <8 x float>
  %41 = call i32 @llvm.x86.avx.vtestz.ps.256(<8 x float> %res.i314, <8 x float> %res.i314)
  %.not428 = icmp eq i32 %41, 0
  br i1 %.not428, label %L206, label %L219

L206:                                             ; preds = %L194
  %res.i313 = call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %res.i336)
  %res.i312 = fsub nsz contract <8 x float> %res.i363, %res.i313
  %res.i311 = fadd nsz contract <8 x float> %res.i363, %res.i313
  %m.i309 = fcmp reassoc nsz arcp contract ogt <8 x float> %res.i312, %29
  %res.i308 = select reassoc nsz arcp contract <8 x i1> %m.i309, <8 x float> %res.i312, <8 x float> %res.i311 
  %m.i305 = fcmp reassoc nsz arcp contract olt <8 x float> %29, %res.i308
  %combinedmask1.i304429 = and <8 x i1> %m.i305, %m.i317
  %res.i303 = select reassoc nsz arcp contract <8 x i1> %combinedmask1.i304429, <8 x float> %res.i308, <8 x float> %value_phi2456
  br label %L219

L219:                                             ; preds = %L206, %L194
  %value_phi26 = phi <8 x float> [ %res.i303, %L206 ], [ %value_phi2456, %L194 ]
  %m.i300 = fcmp reassoc nsz arcp contract olt <8 x float> %value_phi24, %value_phi1454
  %m.i298 = fcmp reassoc nsz arcp contract olt <8 x float> %value_phi26, %value_phi2456
  %res.i297 = select reassoc nsz arcp contract <8 x i1> %m.i300, <8 x float> %value_phi24, <8 x float> %value_phi1454
  %res.i295 = select reassoc nsz arcp contract <8 x i1> %m.i298, <8 x float> %value_phi26, <8 x float> %value_phi2456
  %42 = trunc i64 %value_phi453 to i32
  %ie.i290 = insertelement <8 x i32> undef, i32 %42, i64 0
  %v.i291 = shufflevector <8 x i32> %ie.i290, <8 x i32> undef, <8 x i32> zeroinitializer
  %res.i292 = add nsw <8 x i32> %v.i291, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %43 = add i32 %42, 8
  %ie.i287 = insertelement <8 x i32> undef, i32 %43, i64 0
  %v.i288 = shufflevector <8 x i32> %ie.i287, <8 x i32> undef, <8 x i32> zeroinitializer
  %res.i289 = add nsw <8 x i32> %v.i288, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %res.i286 = select <8 x i1> %m.i300, <8 x i32> %res.i292, <8 x i32> %value_phi20458
  %res.i284 = select <8 x i1> %m.i298, <8 x i32> %res.i289, <8 x i32> %value_phi21459
  %res.i282 = add nuw nsw i64 %value_phi453, 16
  %.not460 = icmp sgt i64 %res.i282, %res.i281
  br i1 %.not460, label %L234, label %L127

L234:                                             ; preds = %L219
  %m.i279 = fcmp reassoc nsz arcp contract olt <8 x float> %res.i297, %res.i295
  %res.i278 = select reassoc nsz arcp contract <8 x i1> %m.i279, <8 x float> %res.i297, <8 x float> %res.i295 
  %res.i272 = select <8 x i1> %m.i279, <8 x i32> %res.i286, <8 x i32> %res.i284
  br label %L242

The block L219 seems like a lot of extra instructions which the SIMD version doesn’t include.

This is off-topic, but this code is quite confusing. First you use T as an alias for Float32, but then later T is a type parameter which is unrelated to the previous T (also, there is an entrenched convention for using T as a type parameter). Similarly, Sphere is both a type and a field inside a different type. It may work, but one starts to wonder if things work as intended.

2 Likes