Consider a toy algorithm that finds the maximum of a vector, then removes it and alters the other entries of the vector in a way that depends on whether they are located at a lower or higher index.
(N.B. I posted a simpler version of this example a few days ago but made a very silly mistake in the benchmark.)
In principle, I expect that using two for loops (one for the indices before the maximum, and one for those after) to updated the entries of the vector individually should be faster than creating a new vector from scratch (using collect
and a ternary operator to check indices). However, this is not the case in the following example:
import Base.isless
# User-facing struct that holds the instance data
struct MyDataFrame
m::Integer
xs::Vector{<:Real}
ys::Vector{<:Real}
xys::Vector{<:Real} # = xs .* ys
omxs::Vector{<:Real} # = 1 .- xs
function MyDataFrame(xs::Vector{<:Real}, ys::Vector{<:Real})
m = length(xs)
return new(m, xs, ys, xs .* ys, 1 .- xs)
end
end
# Struct used internally in manipulatedata()
struct DataPoint
j::Int
x::Real
y::Real
xy::Real # = x * y
omx::Real # = 1 - x
end
# Overload isless() so that DataPoints sort on xy
isless(c1::DataPoint, c2::DataPoint) = isless(c1.xy, c2.xy)
function manipulatedata(mdf::MyDataFrame, coll = true::Bool)
# Convert the user-supplied MyDataFrame into a vector of DataPoints
datapoints = DataPoint[DataPoint(j, mdf.xs[j], mdf.ys[j], mdf.xys[j], mdf.omxs[j]) for j in 1:mdf.m]
for _ in 1:mdf.m
dp_best, idx_best = findmax(datapoints)
if coll # This way is faster
deleteat!(datapoints, idx_best)
datapoints[:] = collect(
DataPoint(
c.j,
c.x,
c.j < dp_best.j ? c.y * dp_best.omx : c.y - dp_best.xy,
c.j < dp_best.j ? c.xy * dp_best.omx : c.xy - c.x * dp_best.xy,
c.omx
)
for c in datapoints
)
else # This way is slower
for i in 1:idx_best-1
datapoints[i] =
DataPoint(
datapoints[i].j,
datapoints[i].x,
datapoints[i].y * dp_best.omx,
datapoints[i].xy * dp_best.omx,
datapoints[i].omx
)
end
for i in idx_best+1:length(datapoints)
datapoints[i] =
DataPoint(
datapoints[i].j,
datapoints[i].x,
datapoints[i].y - dp_best.xy,
datapoints[i].xy - datapoints[i].x * dp_best.xy,
datapoints[i].omx
)
end
deleteat!(datapoints, idx_best)
end
end
return nothing
end
function timer(m)
rectlist = MyDataFrame(rand(m), rand(m))
@time r1 = manipulatedata(rectlist, true)
@time r2 = manipulatedata(rectlist, false)
@assert r1 == r2
end
timer(2000)
Results:
1.387089 seconds (17.99 M allocations: 416.908 MiB, 6.05% gc time, 9.47% compilation time)
4.455283 seconds (49.99 M allocations: 1.667 GiB, 9.34% gc time)
Why might this be?
Note that all of the struct
s defined in the example above appear to be necessary to obtain this unusual performance result. If, for example, instead of taking a MyDataFrame
as the argument to manipulatedata()
you take the vectors x
and y
(and modify the line datapoints = ...
accordingly), then the for loop becomes faster as expected.
Also, the for loop becomes faster if the eltype of the vector is a primitive like Float64
rather than the custom struct DataPoint
(with its custom overload of isless()
).
(The j
field of DataPoint
may appear superfluous, but it is essential in my real application because it serves as a consistent label that each data point retains even as the indices of datapoints
vary.)