# How to speed up LoopVectorized code - 3x slower than C++ code

``````using StaticArrays, Parameters, LinearAlgebra, BenchmarkTools, StructArrays, LoopVectorization

const T = Float32
const Point = SVector{3, T}

@with_kw struct Ray @deftype Point
origin = zeros(Point)
direction = Point(0, 1, 0)
# @assert norm(direction) ≈ 1
end

struct Sphere
centre::Point
end

@with_kw struct Scene{T}
Sphere::T = []
end

function StructArrays.staticschema(::Type{Point})
# Define the desired names and eltypes of the "fields"
return NamedTuple{(:x, :y, :z), fieldtypes(Point)...}
end

StructArrays.component(m::Point, key::Symbol) = getproperty(m, key)
StructArrays.createinstance(::Type{Point}, args...) = Point(args)

function findSceneIntersection(ray, hittable_list, tmin::T, tmax::T)
besti::Int32 = 0

@turbo for i in eachindex(hittable_list.Sphere)
cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
coy = hittable_list.Sphere.centre.y[i] - ray.origin.y
coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

quarter_discriminant = neg_half_b^2 - c
sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = ifelse(quarter_discriminant < 0, tmax,
ifelse(tmin < root < tmax, root,
ifelse(tmin < root2 < tmax, root2, tmax)))

newMinT = t < tmax
tmax = ifelse(newMinT, t, tmax)
besti = ifelse(newMinT, i, besti)
end

return tmax, besti
end

N = 300
spheres = Sphere.([Point(1, 1, 1) * v for v in range(-100, 100, N)], range(5, 50, N))
scene = Scene(StructArray(spheres, unwrap = T -> !(T<:Real)));
ray = Ray()

@benchmark findSceneIntersection(\$ray, \$scene, \$(T(1e-4)), \$(T(Inf)))
``````

The C++ code for running the benchmark is Renderer/simdBenchmarking.cpp at SIMDTesting · Zentrik/Renderer (github.com) and the `findSceneIntersection` function is Renderer/hittable_list.hpp at SIMDTesting · Zentrik/Renderer (github.com).
Right now the `findSceneIntersection` function takes about 190ns compared to C++ code which takes 60ns.
One difference is the C++ code has an `if quarter_discriminant > 0` which gives a ~2x speed improvement in C++, as branching let’s us skip expensive code. I’m not sure how to do that with LoopVectorization though.
I want to write a function like this, where I add an `if` and changed `t` and `newMinT`:

``````function findSceneIntersection(ray, hittable_list, tmin::T, tmax::T)
besti::Int32 = 0

@turbo for i in eachindex(hittable_list.Sphere)
cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
coy = hittable_list.Sphere.centre.y[i] - ray.origin.y
coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

quarter_discriminant = neg_half_b^2 - c

if quarter_discriminant > 0
sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = ifelse(root > tmin, root, root2)

newMinT = tmin < t < tmax
tmax = ifelse(newMinT, t, tmax)
besti = ifelse(newMinT, i, besti)
end
end

return tmax, besti
end
``````
1 Like

Afaik, chained comparisons create branches. Perhaps LV does not like that? What happens if you do `(tmin < root) & (root < tmax)` etc?

2 Likes

As far as I can tell from @code_llvm, the original code doesn’t create a branch. Making the changes you suggested doesn’t make a speed difference.
What I want is some way to create a branch but I don’t know if LoopVectorization even supports that.

1 Like

I’m not sure that `StructArrays` is helping here. They’re useful when you want to perform operations field-wise on a struct-array/array-of-struct, but less so when you want to operate struct-wise like in your example where you access the `x,y,z,radius` fields all concurrently. As it is, you’re looping through 4 different vectors in parallel. There’s still a chance the `StructArrays` play better with vectorization, but maybe not.

I’m also not sure this loop benefits massively from `LoopVectorization`. There’s enough going on (and some of it somewhat vectorizable already) that it’s at least worth trying without `@turbo`. If the C++ is much faster with the conditional execution (which precludes vectorization), it’s likely that the Julia will be too.

At the cost of some extra storage and a second loop, you might be able to vectorize the computation of of `neg_half_b` and `c` and then run a separate non-vectorized loop to compute `tmax,besti` from those results (with the branch on `quarter_discriminant`). I can’t say whether this will be better or not than doing it all in one pass, however.

It appears that you could use `@fastmath` for most/all of the math here without adverse consequences. It will likely save a couple of instructions in the linear algebra and be more accurate rather than less (thanks to `fma`).

Not using StructArray didn’t seem to make a speed difference, but it meant I couldn’t use `@turbo` which gives a significant speed boost.

The C++ code is vectorised using intrinsics, I’m using a library which handles that, perhaps I should have made that clearer. The `if` doesn’t prevent vectorisation at all, the assembly is roughly the same as without the `if` except there is a conditional jump to skip over some expensive computation. See Compiler Explorer, if you search for `sqrtps` you should find the relevant code, you’ll notice all the instructions are vectorized and there’s a jump on the line preceding the sqrt.

Removing `@turbo` slowed the code down to 1.5 microseconds, `@code_llvm` shows that only 2 wide vectors are used compared to 8 wide using `@turbo`. Adding the branch significantly improves it down to 690ns, but still much slower than `@turbo`.

Running two separate loops as you suggested is slower than just one vectorized loop at 280ns.

My understanding is that `@turbo` adds `@fastmath`, I can see fmas in the llvm when just using `@turbo` and adding `@fastmath` didn’t make a difference.

2 Likes

If you feel like implementing it similarly to the C++ code you could try GitHub - eschnett/SIMD.jl: Explicit SIMD vector operations for Julia. Not sure about `horizontal_find_first` but the rest should be available.

Yh, I was hoping I could avoid having to do it explicitly using SIMD.jl or VectorizationBase.jl.

Just FYI: when writing the names of macros, you should wrap them in backticks: ``@turbo`` Otherwise you run the risk of pinging users named “turbo”.

2 Likes

if you remove different parts of the for-loop content you can eventually find out which part Julia is doing much worse right?

I think most of the ray blah blah doesn’t matter, the ray parameters are constant, basically you just have a looper over 3 equal-length array and doing some calculation and `min` reduction

if we can have more identical code on Julia and C++ it would be easier to compare asm

2 Likes

I don’t know SIMD details at all, but I think this is the C++ section that Zentrik is referring to (I removed less relevant lines for brevity):

``````        for (int i = 0; i < (int)radius.size(); i++)
{
... some other vectors....
Vec8fb isDiscriminantPositive = quarter_discriminant > Vec8f(0.0f);

// if ray hits any of the 8 spheres
if (horizontal_or(isDiscriminantPositive)) // Branching gives 2x speedup using sse (i.e. Vec4f but with Aras' code)
{
...expensive computation...
}
curId += Vec8ui(Vec8f::size()); // easy way to keep track of which chunk we are on
}
``````

So the conditional branch is occurring between vectors, in this case deciding whether to skip code for each vector. Elementwise branches cannot be changed to vectorwise branches in general e.g. if you’re choosing A or B elementwise, you don’t want a A-B-A-B-A-B-A-B to be turned to AAAA-BBBB. I don’t know enough about LoopVectorization to know if it can potentially turn some elementwise branching to vectorwise branching (incidentally illustrated in this section of the Automatic vectorization Wikipedia page: Reducing_vectorization_overhead_in_the_presence_of_control_flow).

My own question is: is `besti = ifelse(newMinT, i, besti)` automatically vectorizable and does it have the same result as the manually vectorized C++ code? Your C++ code has a whole section of its own after the loop to select the particular result out of an 8-length vector, and I’m curious if LoopVectorization can do that automatically and correctly.

1 Like
``````#include <benchmark/benchmark.h>
#include <vector>

#include "version2/vectorclass.h"

constexpr const float infinity = (std::numeric_limits<float>::max)(); // idk how well infinity plays with fast math

class hittable_list {
public:

hittable_list() {}

{
radius.push_back(Vec8f(r, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
} else {
int i = horizontal_find_first(radius.back() == Vec8f(0.0f));

}
}

float hit(float t_max) const
{
Vec8f hitT(t_max);

for (int i = 0; i < (int)radius.size(); i++)
{
Vec8fb isDiscriminantPositive = quarter_discriminant > Vec8f(0.0f);

if (horizontal_or(isDiscriminantPositive))
{
Vec8f t = sqrt(sqrt(quarter_discriminant));
t = select(sqrt(t) > 0, t, sqrt(t));
Vec8fb msk = isDiscriminantPositive & (t < hitT);

hitT = select(msk, t, hitT);
}
}

return horizontal_min(hitT);;
}
};

hittable_list setup() {
hittable_list scene;

int N = 296;

for (int i = 0; i < N; i++) {
float r = 0 + i * (2 - 0) / ((float)N - 1.);
}

return scene;
}

template <class ...Args>
static void runbench(benchmark::State& state, Args&&... args) {
auto args_tuple = std::make_tuple(std::move(args)...);
hittable_list scene = std::get<0>(args_tuple);

for (auto _ : state) {
scene.hit(infinity);
}
}

BENCHMARK_CAPTURE(runbench, scene interesction, setup());
BENCHMARK_MAIN();
``````
``````using BenchmarkTools, LoopVectorization

const T = Float32

t = ifelse(quarter_discriminant > 0, sqrt(sqrt(quarter_discriminant)), tmax)
t = ifelse(sqrt(t) > 0, t, sqrt(t))
tmax =  ifelse(t < tmax, t, tmax)
end

return tmax
end

N = 296

``````

On my slower machine, the c++ code takes 175ns whilst the julia code takes 290ns, the multiple `sqrt` and `if` are just to exaggerate the difference between c++ and julia code, so you can remove if need be.
These should have simpler assembly.

Heres a link to the version2 library, vectorclass/version2: Vector class library, latest version (github.com), you just need to clone it and then give the correct path. Here’s the link to the benchmark library for the C++ code, google/benchmark: A microbenchmark support library (github.com).

I’ll look more closely into where the rest of the performance difference between the original codes are tomorrow.

The julia code is giving the same results, I would have to look at the llvm in more detail to see how exactly its doing it but I believe LoopVectorization is automatically selecting the minimum result out of the vector.

1 Like

Trying to run the Julia code on 1.8.5, I get an error about `sqrt` of negative values. It seems like a reasonable error since `ifelse` evaluates all arguments, so the `sqrt` of negatives is also evaluated.
How is this running on your setup?

`@fastmath` makes `sqrt` return 0 if you pass negative values, using `@turbo` has the same effect.

1 Like

It should return `NaN`.

LoopVectorization currently doesn’t support actual branches, but you can hide them.

``````using LoopVectorization, VectorizationBase

@inline function maybecompute(
quarter_discriminant::Vec{W,T},
tmax::Vec{W,T}
) where {W,T}
m = quarter_discriminant > 0
!VectorizationBase.vany(m) && return tmax
t = ifelse(m, sqrt(sqrt(quarter_discriminant)), tmax)
return ifelse(sqrt(t) > 0, t, sqrt(t))
end
@inline function maybecompute(x::VecUnroll, y::VecUnroll)
VecUnroll(
VectorizationBase.fmap(
maybecompute,
VectorizationBase.data(x),
VectorizationBase.data(y)
)
)
end
t = maybecompute(quarter_discriminant, tmax)
tmax = ifelse(t < tmax, t, tmax)
end
return tmax
end
``````

This is still slower than what I got from the C++ version, where clang took 75ns and gcc took a suspicious 17.3 ns.

``````Running /home/chriselrod/Documents/progwork/cxx/Repos/Renderer/c++/bench2gcc
Run on (36 X 4190.78 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x18)
L1 Instruction 32 KiB (x18)
L2 Unified 1024 KiB (x18)
L3 Unified 25344 KiB (x1)
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
runbench/scene interesction       17.3 ns         17.3 ns     40945414
``````
``````julia> @benchmark findSceneIntersection_maybecompute(\$radius, \$(T(Inf)))
BenchmarkTools.Trial: 10000 samples with 961 evaluations.
Range (min … max):  85.970 ns … 106.797 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     86.086 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   86.290 ns ±   0.680 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▁█▇▂                                       ▁ ▁▁▁▁▁          ▂
▅████▇▅▄▄▁▁▁▁▃▁▃▁▁▃▃▁▁▁▁▁▃▄▃▁▃▃▁▁▁▁▁▃▁▅▅▆▆▇███████████▇█▇▇▇▆ █
86 ns         Histogram: log(frequency) by time      88.6 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

The original LoopVectorization.jl version took 155ns on my computer.

2 Likes

Perhaps gcc/clang understood they can elide:

The original c++ code I was using to benchmark was optimising away some stuff, so fixing that the timings were about 76ns for gcc and 80ns for clang.
This julia code using a fork of SIMD.jl (using same boilerplate as in the first post) matched clang when running on 1.9.,0-rc2

``````const N = 8

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
s = """
%mask = trunc <\$N x i8> %0 to <\$N x i1>
%res = bitcast <\$N x i1> %mask to i\$N
ret i\$N %res
"""
return :(
\$(Expr(:meta, :inline));
)
end

function hor_min(x::SIMD.Vec{8, T}) where T
@fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
@fastmath b = min(a, x)
@fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
@fastmath b = min(a, b)
@fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
@fastmath b = min(a, b)
return @inbounds b[1]
end

@fastmath function findSceneIntersection(r, hittable_list, tmin, tmax)
hitT = SIMD.Vec{N, T}(tmax)
laneIndices = SIMD.Vec{N, Int32}(Int32.((1, 2, 3, 4, 5, 6, 7, 8)))
minIndex = SIMD.Vec{N, Int32}(0)

@inbounds @fastmath for lane in LoopVecRange{N}(hittable_list.Sphere, unsafe=true)
cox = hittable_list.Sphere.centre.x[lane] - r.origin.x
coy = hittable_list.Sphere.centre.y[lane] - r.origin.y
coz = hittable_list.Sphere.centre.z[lane] - r.origin.z

neg_half_b = r.direction.x * cox + r.direction.y * coy
neg_half_b += r.direction.z * coz

c = cox*cox + coy*coy
c += coz*coz

quarter_discriminant = neg_half_b^2 - c
isDiscriminantPositive = quarter_discriminant > 0

if any(isDiscriminantPositive)
@fastmath sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = vifelse(root > tmin, root, root2)

newMinT = isDiscriminantPositive & (tmin < t) & (t < hitT)
hitT = vifelse(newMinT, t, hitT)
minIndex = vifelse(newMinT, laneIndices, minIndex)
end

laneIndices += N
end

minHitT = hor_min(hitT)

if minHitT < tmax
@inbounds i = minIndex[trailing_zeros(getBits(hitT == minHitT)) + 1]

return minHitT, i
else
return minHitT, 0
end
end
``````

The version using LoopVectorization.jl took about 100-110ns I believe

``````@inline function maybecompute(
neg_half_b::VectorizationBase.Vec{W,T},
quarter_discriminant::VectorizationBase.Vec{W,T},
tmax::VectorizationBase.Vec{W,T},
tmin::VectorizationBase.Vec{W,T}
) where {W,T}
m = quarter_discriminant > 0
!VectorizationBase.vany(m) && return tmax

sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give NaN

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = ifelse(root > tmin, root, root2)
t = ifelse(m & (tmin < t), t, tmax)

return t
end

@inline function maybecompute(x::VecUnroll, y::VecUnroll, z::VecUnroll, tmin::VectorizationBase.Vec{W,T}) where {W, T}
VecUnroll(
VectorizationBase.fmap(
maybecompute,
VectorizationBase.data(x),
VectorizationBase.data(y),
VectorizationBase.data(z),
tmin,
)
)
end

function findSceneIntersection_maybecompute(ray, hittable_list, tmin::T, tmax::T)
besti::Int32 = 0
x = VectorizationBase.Vec{8,T}(tmin)

@turbo for i in eachindex(hittable_list.Sphere)
cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
coy = hittable_list.Sphere.centre.y[i] - ray.origin.y
coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

quarter_discriminant = neg_half_b^2 - c
t = maybecompute(neg_half_b, quarter_discriminant, tmax, x)

newMinT = t < tmax
tmax = ifelse(newMinT, t, tmax)
besti = ifelse(newMinT, i, besti)
end

return tmax, besti
end
``````
1 Like

Have a runable example? I’m curious why LV would be that much slower. I don’t think `maybecompute` should be that bad.

``````using StaticArrays, LinearAlgebra, BenchmarkTools, StructArrays, LoopVectorization, VectorizationBase

const T = Float32
const Point = SVector{3, T}

@kwdef struct Ray
origin::Point = zeros(Point)
direction::Point = Point(0, 1, 0)
end

struct Sphere
centre::Point
end

@kwdef struct Scene{T}
Sphere::T = []
end

function StructArrays.staticschema(::Type{Point})
# Define the desired names and eltypes of the "fields"
return NamedTuple{(:x, :y, :z), fieldtypes(Point)...}
end

StructArrays.component(m::Point, key::Symbol) = getproperty(m, key)
StructArrays.createinstance(::Type{Point}, args...) = Point(args)

N = 296
spheres = Sphere.([Point(1, 1, 1) * v for v in range(-100, 100, N)], range(5, 50, N))
scene = Scene(StructArray(spheres, unwrap = T -> !(T<:Real)));
ray = Ray()

@inline function maybecompute(
neg_half_b::VectorizationBase.Vec{W,T},
quarter_discriminant::VectorizationBase.Vec{W,T},
tmax::VectorizationBase.Vec{W,T},
tmin::VectorizationBase.Vec{W,T}
) where {W,T}
m = quarter_discriminant > 0
!VectorizationBase.vany(m) && return tmax

sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give NaN

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = ifelse(root > tmin, root, root2)
t = ifelse(m & (tmin < t), t, tmax)

return t
end

@inline function maybecompute(x::VecUnroll, y::VecUnroll, z::VecUnroll, tmin::VectorizationBase.Vec{W,T}) where {W, T}
VecUnroll(
VectorizationBase.fmap(
maybecompute,
VectorizationBase.data(x),
VectorizationBase.data(y),
VectorizationBase.data(z),
tmin,
)
)
end

function findSceneIntersection_maybecompute(ray, hittable_list, tmin::T, tmax::T)
besti::Int32 = 0
x = VectorizationBase.Vec{8,T}(tmin)

@turbo for i in eachindex(hittable_list.Sphere)
cox = hittable_list.Sphere.centre.x[i] - ray.origin.x
coy = hittable_list.Sphere.centre.y[i] - ray.origin.y
coz = hittable_list.Sphere.centre.z[i] - ray.origin.z

neg_half_b = ray.direction.x * cox + ray.direction.y * coy + ray.direction.z * coz
c = cox^2 + coy^2 + coz^2 - hittable_list.Sphere.radius[i]^2

quarter_discriminant = neg_half_b^2 - c
t = maybecompute(neg_half_b, quarter_discriminant, tmax, x)

newMinT = t < tmax
tmax = ifelse(newMinT, t, tmax)
besti = ifelse(newMinT, i, besti)
end

return tmax, besti
end

findSceneIntersection_maybecompute(ray, scene, T(1e-4), T(Inf))
@benchmark findSceneIntersection_maybecompute(\$ray, \$scene, \$(T(1e-4)), \$(T(Inf)))

using SIMD

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
s = """
%mask = trunc <\$N x i8> %0 to <\$N x i1>
%res = bitcast <\$N x i1> %mask to i\$N
ret i\$N %res
"""
return :(
\$(Expr(:meta, :inline));
)
end

function hor_min(x::SIMD.Vec{8, T}) where T
@fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
@fastmath b = min(a, x)
@fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
@fastmath b = min(a, b)
@fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
@fastmath b = min(a, b)
return @inbounds b[1]
end

@fastmath function findSceneIntersection(r, hittable_list, tmin, tmax)
N = 8
hitT = SIMD.Vec{N, T}(tmax)
laneIndices = SIMD.Vec{N, Int32}(Int32.((1, 2, 3, 4, 5, 6, 7, 8)))
minIndex = SIMD.Vec{N, Int32}(0)

lane = VecRange{N}(1)
@inbounds @fastmath while lane.i <= length(hittable_list.Sphere)
cox = hittable_list.Sphere.centre.x[lane] - r.origin.x
coy = hittable_list.Sphere.centre.y[lane] - r.origin.y
coz = hittable_list.Sphere.centre.z[lane] - r.origin.z

neg_half_b = r.direction.x * cox + r.direction.y * coy
neg_half_b += r.direction.z * coz

c = cox*cox + coy*coy
c += coz*coz

quarter_discriminant = neg_half_b^2 - c
isDiscriminantPositive = quarter_discriminant > 0

if any(isDiscriminantPositive)
@fastmath sqrtd = sqrt(quarter_discriminant) # When using fastmath, negative values just give 0

root = neg_half_b - sqrtd
root2 = neg_half_b + sqrtd

t = vifelse(root > tmin, root, root2)

newMinT = isDiscriminantPositive & (tmin < t) & (t < hitT)
hitT = vifelse(newMinT, t, hitT)
minIndex = vifelse(newMinT, laneIndices, minIndex)
end

laneIndices += N
lane += N
end

minHitT = hor_min(hitT)

if minHitT < tmax
@inbounds i = minIndex[trailing_zeros(getBits(hitT == minHitT)) + 1]

return minHitT, i
else
return minHitT, Int32(0)
end
end

findSceneIntersection(ray, scene, T(1e-4), T(Inf))
@benchmark findSceneIntersection(\$ray, \$scene, \$(T(1e-4)), \$(T(Inf)))
``````

The median/mean time for `findSceneIntersection_maybecompute` was 105ns and 82ns for `findSceneIntersection`. I don’t really see why `findSceneIntersection_maybecompute` is slower, maybe the reduction from the vector of `besti` to a scalar is slow though I would be surprised if it was 20ns slower. There’s also that `findSceneIntersection_maybecompute` has an extra `ifelse`, I’m a bit confused by what’s going on in the llvm ir for that

``````L194:                                             ; preds = %L185, %L127
%value_phi24 = phi <8 x float> [ %res.i320, %L185 ], [ %value_phi1454, %L127 ]
%m.i317 = fcmp reassoc nsz arcp contract ogt <8 x float> %res.i336, zeroinitializer
%res.i316 = sext <8 x i1> %m.i317 to <8 x i32>
%res.i314 = bitcast <8 x i32> %res.i316 to <8 x float>
%41 = call i32 @llvm.x86.avx.vtestz.ps.256(<8 x float> %res.i314, <8 x float> %res.i314)
%.not428 = icmp eq i32 %41, 0
br i1 %.not428, label %L206, label %L219

L206:                                             ; preds = %L194
%res.i313 = call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %res.i336)
%res.i312 = fsub nsz contract <8 x float> %res.i363, %res.i313
%res.i311 = fadd nsz contract <8 x float> %res.i363, %res.i313
%m.i309 = fcmp reassoc nsz arcp contract ogt <8 x float> %res.i312, %29
%res.i308 = select reassoc nsz arcp contract <8 x i1> %m.i309, <8 x float> %res.i312, <8 x float> %res.i311
%m.i305 = fcmp reassoc nsz arcp contract olt <8 x float> %29, %res.i308
%combinedmask1.i304429 = and <8 x i1> %m.i305, %m.i317
%res.i303 = select reassoc nsz arcp contract <8 x i1> %combinedmask1.i304429, <8 x float> %res.i308, <8 x float> %value_phi2456
br label %L219

L219:                                             ; preds = %L206, %L194
%value_phi26 = phi <8 x float> [ %res.i303, %L206 ], [ %value_phi2456, %L194 ]
%m.i300 = fcmp reassoc nsz arcp contract olt <8 x float> %value_phi24, %value_phi1454
%m.i298 = fcmp reassoc nsz arcp contract olt <8 x float> %value_phi26, %value_phi2456
%res.i297 = select reassoc nsz arcp contract <8 x i1> %m.i300, <8 x float> %value_phi24, <8 x float> %value_phi1454
%res.i295 = select reassoc nsz arcp contract <8 x i1> %m.i298, <8 x float> %value_phi26, <8 x float> %value_phi2456
%42 = trunc i64 %value_phi453 to i32
%ie.i290 = insertelement <8 x i32> undef, i32 %42, i64 0
%v.i291 = shufflevector <8 x i32> %ie.i290, <8 x i32> undef, <8 x i32> zeroinitializer
%res.i292 = add nsw <8 x i32> %v.i291, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%43 = add i32 %42, 8
%ie.i287 = insertelement <8 x i32> undef, i32 %43, i64 0
%v.i288 = shufflevector <8 x i32> %ie.i287, <8 x i32> undef, <8 x i32> zeroinitializer
%res.i289 = add nsw <8 x i32> %v.i288, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%res.i286 = select <8 x i1> %m.i300, <8 x i32> %res.i292, <8 x i32> %value_phi20458
%res.i284 = select <8 x i1> %m.i298, <8 x i32> %res.i289, <8 x i32> %value_phi21459
%res.i282 = add nuw nsw i64 %value_phi453, 16
%.not460 = icmp sgt i64 %res.i282, %res.i281
br i1 %.not460, label %L234, label %L127

L234:                                             ; preds = %L219
%m.i279 = fcmp reassoc nsz arcp contract olt <8 x float> %res.i297, %res.i295
%res.i278 = select reassoc nsz arcp contract <8 x i1> %m.i279, <8 x float> %res.i297, <8 x float> %res.i295
%res.i272 = select <8 x i1> %m.i279, <8 x i32> %res.i286, <8 x i32> %res.i284
br label %L242
``````

The block L219 seems like a lot of extra instructions which the SIMD version doesn’t include.

This is off-topic, but this code is quite confusing. First you use `T` as an alias for `Float32`, but then later `T` is a type parameter which is unrelated to the previous `T` (also, there is an entrenched convention for using `T` as a type parameter). Similarly, `Sphere` is both a type and a field inside a different type. It may work, but one starts to wonder if things work as intended.

2 Likes