It is not. C++ has different semantics from Julia.
C++ automatically inserts function calls at key points (e.g., when a function receives an argument by value, or when an object leaves scope).
Julia does not.
Thus, by simply defining a few methods in C++, you can automate the memory allocation behavior in any way you like.
But these function calls must be inserted everywhere manually in Julia to replicate the behavior, meaning the only options in Julia are fully manual memory management, or relying on the GC.
For me, 7.7 ms or 0.169365/0.007689 = 22x faster!, so since the original C++ code was “almost 9 times faster” I assume it’s now 2.4x slower, so I advise the thread be solved with your solution! I still think maybe it could be made faster with alternative vector (also in C++), and now I’m thinking what’s wrong with the C++ code… but I’m not going to spend time on that!
I’m not sure maybe this was just an exercise for the new Julia user, he forked some C++ code from a Python project. I think it’s fair game to use code people have helped with on discourse (should give you credit), and if it’s better than Julia clustering packages, hopefully we see it merged into some Julia cluster package.
Why I mentioned bittypes. In Julia you can use manual memory management, or if calling C or C++, code using it indirectly, and then no issue with the GC, since it’s bypassed (for vectors of e.g. integers).
I don’t know how Julis’s GC scans stackframes, but I believe it can and does conservatively (at least in some cases, to call C code), so ok if some integers look like pointers.
https://groups.google.com/g/julia-users/c/JeuJ-QUuLJA/m/xNTgRR31ImcJ
Ah, that one I didn’t notice. That’s clearly a major source of cycle spending.
So, @Adegasel, whereas in many languages, type-annotation of function arguments is important, that’s not the case in Julia. Unless you have functions with the same name but with different arguments, you need not annotate with types. On the other hand, type annotation in struct
s is extremely important for performance.
The Array{Float32}
is an abstract type, it can hold a vector (an Array{Float32, 1}
), a matrix (an Array{Float32, 2}
), or a higher dimensional array. All that is known to the compiler when calc
is compiled is that cc
holds some kind of arrays of Float32
. It has to figure out at runtime, every time cc.cumsum
/cc.cumsum2
is accessed, the dimension of the array, and figure out the right indexing method for this kind of array.
There’s a bunch of side talk about GCs and other more or less random stuff unrelated to the problem at hand here. As almost always, the solution is just to write completely normal Julia code that follows the performance guide as shown in C++ code much faster than Julia how can I optimize it? - #20 by pabloferz. I would suggest the other discussion gets moved to its own thread since it is quite distracting and irrelevant.
Removing all use of push!
and nearly all allocations in the inner loops shaves this down to 3ms for me from 5 for the version of @pabloferz.
KMeans1D
Edit: somehow this wasn’t the right code. But the one below is.
And using a Vector{Int32}
instead of a Dict
gets it down to 2.1 seconds. So probably >5x faster than the C++?
Kmeans1d no Dict
module Kmeans1D
function smawk!(result, rows, cols, allocs, lookup, k)
length(rows) == 0 && return result
(; cols_alloc, odd_rows, col_idx_lookup) = allocs[1]
## REDUCE
ncols = 0
@inbounds for col in cols
while true
ncols == 0 && break
row = rows[ncols]
if lookup(k, row, col) >= lookup(k, row, cols_alloc[ncols])
break
end
ncols -= 1
end
if ncols < length(rows)
ncols += 1
cols_alloc[ncols] = col
end
end
_cols = view(cols_alloc, Base.OneTo(ncols))
# Call recursively on odd-indexed rows
@inbounds for i in 2:2:length(rows)
odd_rows[i >> 1] = rows[i]
end
smawk!(result, odd_rows, _cols, view(allocs, 2:lastindex(allocs)), lookup, k)
@inbounds for idx in 1:ncols
col_idx_lookup[_cols[idx]] = idx
end
## INTERPOLATE
# Fill-in even-indexed rows
start = 1
@inbounds for r in 1:2:length(rows)
row = rows[r]
stop = length(_cols) - 1
if r < (length(rows) - 1)
stop = col_idx_lookup[result[rows[r + 1]]]
end
argmin = _cols[start]
min = lookup(k, row, argmin)
for c in start+1:stop+1
value = lookup(k, row, _cols[c])
if (c == start) || (value < min)
argmin = _cols[c]
min = value
end
end
result[row] = argmin
start = stop
end
return result
end
struct CostCalculator
cumsum::Vector{Float32}
cumsum2::Vector{Float32}
function CostCalculator(array, n::Integer)
cumsum = zeros(n+1)
cumsum2 = zeros(n+1)
@inbounds for i in 1:n
x = array[i]
cumsum[i+1] = x + cumsum[i]
cumsum2[i+1] = x * x + cumsum2[i]
end
@assert length(cumsum) == length(cumsum2) == n + 1
return new(cumsum, cumsum2)
end
end
function calc(cc::CostCalculator, i, j)
if j < i
return zero(eltype(cc.cumsum))
end
@inbounds begin
mu = (cc.cumsum[j + 1] - cc.cumsum[i]) / (j - i + 1)
result = cc.cumsum2[j + 1] - cc.cumsum2[i]
result += (j - i + 1) * (mu * mu)
result -= (2 * mu) * (cc.cumsum[j + 1] - cc.cumsum[i])
end
return result
end
struct LookUp
calculator::CostCalculator
D::Matrix{Float32}
end
function (lu::LookUp)(k, i, j)
col = min(i, j - 1)
if col == 0
col = size(lu.D, 2) + col
end
return @inbounds lu.D[k - 1, col] + calc(lu.calculator, j, i)
end
function cluster(array, k)
n = length(array)
return cluster(array, n, min(k, n))
end
function cluster(array, n, k)
# Sort input array and save info for de-sorting
sort_idx = sortperm(array)
undo_sort_lookup = Vector{Int32}(undef, n)
sorted_array = Vector{Float32}(undef, n)
@inbounds for i in 1:n
sorted_array[i] = array[sort_idx[i]]
undo_sort_lookup[sort_idx[i]] = i
end
#Set D and T using dynamic programming algorithm
cost_calculator = CostCalculator(sorted_array, n)
D = Matrix{Float32}(undef, k, n)
T = Matrix{Int32}(undef, k, n)
lookup = LookUp(cost_calculator, D)
@inbounds for i in 1:n
D[1, i] = calc(cost_calculator, 1, i)
T[1, i] = 1
end
row_argmins = Vector{Int32}(undef, n)
rows = Int32.(1:n)
cols = Int32.(1:n)
allocs = NamedTuple{(:odd_rows,:cols_alloc,:col_idx_lookup),Tuple{Vector{Int32},Vector{Int32},Vector{Int32}}}[]
l = length(rows)
while true
l == 0 && break
odd_rows = Vector{Int32}(2:2:l)
cols_alloc = zeros(Int32, l)
col_idx_lookup = Vector{Int32}(undef, length(rows))
push!(allocs, (; odd_rows, cols_alloc, col_idx_lookup))
l ÷= 2
end
for k_ in 2:k
smawk!(row_argmins, rows, cols, allocs, lookup, k_)
@inbounds for i in 1:n
argmin = row_argmins[i]
min = lookup(k_, i, argmin)
D[k_, i] = min
T[k_, i] = argmin
end
end
#Extract cluster assignments by backtracking
centroids = zeros(k)
sorted_clusters = Vector{Int32}(undef, n)
t = n + 1
k_ = k
n_ = n
@inbounds while t > 1
t_ = t
t = T[k_, n_]
centroid = 0.0
for i in t:t_-1
sorted_clusters[i] = k_
centroid += (sorted_array[i] - centroid) / (i - t + 1)
end
centroids[k_] = centroid
k_ -= 1
n_ = t - 1
end
clusters = Vector{Int32}(undef, n)
@inbounds for i in 1:n
clusters[i] = sorted_clusters[undo_sort_lookup[i]]
end
return centroids, clusters
end
end # module
I’m seeing lots of @inbounds for i in 1:n
without any type restriction on the array type which seems unsafe.
Sure, feel free to add a one-based indexing check at the start, or a conversion step?
But mostly those loops are on Vector
defined in the code, independent of the original type.
I don’t see how you can comfortably add @inbounds
to something like this. I would not recommend @inbounds
at all, except in tightly controlled situations.
This mostly isn’t my code, I just stuck with @pabloferz mostly putting @inbounds
everywhere (or how would you see the comparative improvment of removing allocations?). And these acutaly are tightly controlled conditions. All of those objects are allocated and assigned to internally.
Everyone is free to post their own versions here too
(You can remove them all for a time of 2.8 seconds, which is still pretty good. But it seems the algorithm is fine because there are no errors doing this.)
Just commenting on your comment to the comment about the code
With triply nested indexing, you need a pretty steady hand to ensure nothing goes out of bounds. Seems risky with regular Array
s, let alone general indexable collections.
They’re not generally indexable collections, they’re basically all internally allocated Vector
and ranges, except at the outermost level at the start.
I don’t understand why you’re doubling down on this, instead of acknowledging that this is a bad use of @inbounds
even with regular Arrays. It’s a triply nested index! There’s a zillion ways to mess that up. I would never use @inbounds
in my own code for that.
Because this is a performance comparison with C++ ?
(and like I said, feel free to post your version, I even timed it and said it was fine, above. I’m just trying to make this run fast and have a little fun here ?)
I don’t have my own version to post, and even if I did, posting it wouldn’t help warn against dangerous practices.
I do think there’s a reckless culture surrounding the use of @inbounds
, and that’s what I’m reacting to. But since you seem to fundamentally disagree, I’ll give up the back-and-forth.
I think the original idea with using it is to make this a fair fight with C++, which doesn’t have inbounds bounds checking.
It’s great that you encourage not using inbounds, and I fully agree it should mostly not be used. But in a performance comparison with C++, not using @inbounds
is giving ourselves an unnecessary handicap. Here, its 25% slower.
There is no need to collect these ranges, as far as I can see, is there?
Good catch, that function isn’t actually used anymore, only the in place version where I don’t use collect
. I’ll edit the code.
…unless you use eachindex
, which you don’t seem to use at all.
This isn’t actally my code, again… I’m just removing the allocations in the code higher up to make it faster, not fixing everything.
In some cases we a looping over a step range, so cant use eachindex
. But in others, totally eachindex
can be used, please post a version with that…