C++ code much faster than Julia how can I optimize it?

I’m trying to adapt this C++ code to Julia (here my translation). However C++ original code is much faster than Julia. How can I optimize it?

Here is one important point:

    odd_rows = []

This creates an array of type Any, which means it can contain anything. This is slow. Use odd_rows = eltype(rows)[]

The same goes for `

    col_idx_lookup = Dict()

This should be Dict{Int,Int})() as far as I see in the code. (I’m not sure if using a dict here is the best alternative, instead of a plain vector, for instance).

(the same ideas apply to _cols = [] and the Dict() that is returned when length(rows) == 0, at the beginning).

Fixing these type instabilities will probably accelerate a lot the code.

Edit: these are examples of the Abstract containers section of the performance tips.

4 Likes

In build_cumsum and the odd_rows loop you are using push! instead of just preallocating and using setindex!.

Don’t use push! if you don’t have to. In both cases you know the length of the final array.

I also imagine you have type instabilities. Better to break these big functions into smaller ones with consistent return types. And use @code_warntype or Cthulhu.@descend to make sure everything is stable.

Thanks! I will check it out, however, the main issue seems to be the way in which this matrix C is defined; in the original c++ code:

for (ulong k_ = 1; k_ < k; ++k_) {
auto C = [&D, &k_, &cost_calculator](ulong i, ulong j) → double {
ulong col = i < j - 1 ? i : j - 1;
return D.get(k_ - 1, col) + cost_calculator.calc(j, i);

I have only been able to define C using an ad-hoc method, which takes much longer:
image
where

function C_builder(C, D, k, n, cost_calculator)
for i in 1:n
for j in 1:n
col = min(i, j-1)
if col < 1
col = n + col
end
C[i, j] = D[k-1, col] + calc(cost_calculator, j, i)
end
end
return C
end

Is there maybe an equivalent to the “auto” initializer in Julia?

A small hint for getting help: post code inline here between code fences ```

Like:

julia_code(here)

So indentation and highlighting is nice. Your code can be about half as long too, you will get more help if its easier to edit and fits in a comment here (and runs as a MWE).

See also:

4 Likes

Without running the code (you will also get better advice if you post a minimal running example here), I would bet that the main difference is not there. Avoiding abstract containers and global variables are the first things we have to take care to write performant Julia code.

That loop specifically may be slightly faster if you run over the i indexes first (inner loop), but that is a minor improvement relative to the other suggestions, whose relevance needs to be tested in practice.

2 Likes

Thanks for the advices, I have made some modifications accordingly (here). However, I am still getting considerably slower execution times in Julia than with the C++ version (wrapped with Python). In a test available here,

X = rand(Float32, 1000)
k = 32
@btime Kmeans1D.cluster_julia(X,k)

I am getting an execution time of 59.010 ms, while the original version is almost 9 times faster:

How could I further optimize my Julia code? I am quite new to Julia so any comment would be much
appreciated :slight_smile:

I think it might be a good idea to review the performance tips in the documentation. For example, it recommends against the use of global variables. In your code, you have the following:

global D = Matrix{Float32}(undef, k, n)

global cost_calculator = CostCalculator()

This is typically bad form in any language, but it also had the downside of poor performance in Julia, as the compiler cannot reliably infer the type of global variables.

Your code might return different types depending on condition. The conditional returns an integer, but the other part of the code, depending on inputs, might return a float.

function calc(cc::CostCalculator, i, j)
    if j<i
        return 0
    end
    mu = (cc.cumsum[j + 1] - cc.cumsum[i]) / (j - i + 1)
    result = cc.cumsum2[j + 1] - cc.cumsum2[i]
    result += (j - i + 1) * (mu * mu)
    result -= (2 * mu) * (cc.cumsum[j + 1] - cc.cumsum[i])

    return result
end

Although the compiler seems to deal with these cases better than in the past, it might be worth avoiding if possible.

3 Likes

You want to profile your code:

julia> using ProfileView
julia> @profview cluster(array, n, k, cost_calculator, D)

Doing that I found your code (or with my slight changes, and running with bounds checks off) I found this line to be slowest by far:

result = smawk(odd_rows, _cols, lookup, k, n)

but since your code is recursive, that’s likely not too helpful or unexpected, so eyeballing the flamegraph, the next slowest line (at only about 1/10 of your time), seems to be:

if lookup(k, rows[index], col) >= lookup(k, rows[index], _cols[end])

and then:

return D[k-1, col] + calc(cost_calculator, j, i)

from the function C_function, and since lookup, is the function C_function passed in, maybe the problem is there. Do C++ compilers do anything fancy with recursion that Julia compilers do not yet do?

2 Likes

I see in the C++ code, that rows and cols are by reference, as opposed to result:

void _smawk(
        const vector<ulong>& rows,
        const vector<ulong>& cols,
        const function<T(ulong, ulong)>& lookup,
        vector<ulong>* result) {
    // Recursion base case
    if (rows.size() == 0) return;

You can’t specify that in Julia (at the syntax level), but I find it likely that it means rows and cols are small (and you want them stack-allocated), and you want to use StaticArrays.jl for those, to get that done.

I understand C++ can have vectors (and fixed-sized arrays) on the stack, Julia can but only for fixed-sized now. I recall @elrod having some advanced Julia code from LLVM for hybrid fixed/heap vector code. So I’m not sure you can do this without it unless there’s some max. (small) size.

You can’t push! to SVector or MVector, from that package, and I’m not up-to-speed on that package (or other; building on that one), so it seems SizedVector isn’t for that either. What you want is a type that stores the max, and a count of actually used thereof (of the stack), so that you can push! and pop!. The LLVM code I mentioned is like that, but then above some max then it heap-allocates for it. I believe that may be the default in C++, one of its few advantages. Inherently there’s nothing missing to define such in Julia, as if I recall already done. But it’s not done for the default arrays in Julia (yet), nor it seems in that package or any I know of.

I tried the @tailrec macro, but it seems ineffective, since TCO doesn’t apply here (nor then in C++), so, to answer my own question, I don’t think C++ does anything clever per se for recursion either, except stack-allocate you instruct it to do.

The C_function definitely has problems. It calls calc, which is type-unstable (which can incur allocations), and it uses two global variables, D and cost_calculator. This means that in addition to a handful of loads and some arithmetic inside calc, the C_function must figure out the type of D and find out which getindex instance to call, and figure out the type of cost_calculator and lookup the right calc-instance. Then it must figure out whether calc returned an Int or a Float32. Everything each time C_function is called in the innermost loop.

All this easily adds some hundreds cpu cycles to the 30-ish actually doing the work inside calc, making C_function an order of magnitude slower or so.

So, the return 0 in calc should be replaced by return 0f0, or return Float32(0), or return zero(eltype(cc.cumsum)).

And D and cost_calculator should not be globals, and if they are, they should be constant. Or, at the very least, they should be type-annotated when defined (and/or when used), i.e. something like

@eval cost_calculator::CostCalculator = CostCalculator()

(@eval is done at top-level, so it will become global, with fixed type). Simlarly with D, i.e.

@eval D::Matrix{Float32} = Matrix{Float32}(undef,$k,$n)

(The $ is there because k and n are not defined at top-level, where the @eval is run. The $ interpolates the current value).
I do however not know if these annotations will be sufficient to avoid bad things in C_function, and it’s definitely not how things should be done, but it might work.

Note Julia’s profiler

can profile into Julia’s core code and even (optionally) into C and Fortran libraries.

and the flamegraphs do by default only show Julia code, but you can do:

julia> using ProfileView
julia> Profile.clear();
julia> @profile cluster(array, n, k, cost_calculator, D);
julia> ProfileView.view(C = true)

and then I get a similar story, smawk take a lot of time but calls “none, C_function: line 0”.

There is no line 0, and I assume that (large fraction of the time) is malloc indirectly done by C_function. Just above that with about as much time is “C_function: line 6”, so I assume it’s responsible:

return D[k-1, col] + calc(cost_calculator, j, i)

I missed that that references the global D. I had taken it out of the code, and passed D in, then not a problem, there, but would still have been at that other location.

There’s also the new memory allocation profiler I haven’t yet tried for your (or any) code. I believe them to be the source of your problem. They are quite many, and stack allocations are much better, yes, they are still “allocations”, but not counted, since (usually) not really problematic. Is there a good way in C/C++ to count number (and amount in MB) of (heap) allocations?

1 Like

Julia uses pass by sharing.
In C++, you’d generally pass arguments by reference unless they’re very small, or you need to copy.

Not that advanced. Just create a StrideArray (similar to a StaticArrays.MArray), GC.@preserve it over its lifetime, and then convert to a PtrArray before passing it to any non-inlined functions.
This would also work when starting with an MArray instead of a StrideArray.
The idea is that they’ll be stack allocated if they don’t escape.

Oh, this is the llvm::SmallVector. Julia’s semantics do not allow defining such an object as efficiently or ergonomically as is possible in C++.
While a C++ package can define this (like LLVM and Boost do), that is not possible at the package level in Julia without substantial overhead, which would defeat the point.

C++ lifetimes are well defined, and destructors statically called at the end.
finalizers + Julia’s GC are slow, in comparison.

1 Like

Possibly problematic are those Any I see, and also in grey (not showing here same colors), Vector{Int32} and Dict{Int32, Int32}. I don’t know what grey means (vs. usual cyan), nor where to locate those parts in the source code:

julia> @code_typed smawk(cost_calculator, D, [], [], C_function, k, n)
[..]
48 ──        invoke Base.rehash!(%150::Dict{Int32, Int32}, 1500::Int64)::Dict{Int32, Int32}
[..]
│     %178 = Base.arrayref(false, %6, %176)::Int32
│            invoke Base.setindex!(%150::Dict{Int32, Int32}, %176::Int64, %178::Int32)::Any
[..]
108 ─ %335 = π (%219, Int64)
│            Base.arrayset(false, %149, %332, %335)::Vector{Int32}
└────        goto #110
109 ─        Base.setindex!(%149, %332, %219)::Any

I recommend Cthulhu for interactive exploration.
A couple options to point out:

  • w turns warnings for type instabilities on/off (default: off).
  • o turns optimizations on/off (default: on).
  • d turns debug info, like line numbers, on/off (default: off)

The defaults (no warnings, optimizations on, no debug) is like code_typed. Turned warnings on and optimizations off is like code_warntype.
I almost always want warnings on, and tend to prefer debug info to be on as well.

It also lets you descend into any functions of interest. So if you see a type instability originates from a call to foo, you can select foo and look inside. Keep digging in this way, and you’ll find the root of the problem.

2 Likes

I think that’s already advanced (for many people). I know of GC.@preserve have never used it (is there something similar in C++? I suppose not because non-GC language, isn’t this only needed for threaded code, that C++ though does have), not sure most Julia people know of it or ever use PtrArray.

C++ has no need for GC.preserve, not having a GC. That a language without GC gives users much more control over memory management should be no surprise.

Still, it’s possible to preallocate in Julia and even use a rudimentary pointer-bump allocator. SimpleChains.jl does this, and achieves 800x better performance than Flux.jl for a few use cases.

PtrArray is defined here. I wouldn’t necessarily recommend it to most users. Most code bases are full of much larger, juicier, and lower hanging fruit.

1 Like

I looked into both, it wasn’t too obvious from its docs what they are for (or for extensible), well for strided… so implicitly documented. Though I found:

StrideArray.jl and PtrArray.jl are mutable array types with optional static sizing providing by StrideArrays.jl.

For me:

julia> SV = StrideVector{UInt32}
StrideArray{UInt32, D, T, 1, C, B, R, X, O, A} where {D, T, C, B, R, X, O, A}

julia> push!(SV, 2)
ERROR: MethodError: no method matching push!(::Type{StrideArray{UInt32, D, T, 1, C, B, R, X, O, A} where {D, T, C, B, R, X, O, A}}, ::Int64)

I just guess the “optional static sizing” is done differently. Could it (also) be done more user-friendly, also supporting push!? I have a feeling this package was made more for Arrays rather than Vectors, so an afterthought?

Exactly what I had in mind. Is it still possible to do, or reuse, in Julia non-ergonomically? Anyway I recalled this, in LLVM, so not in std::vector, then not the reason C++ is faster here, but with it C++ could be even faster than Julia, here…

It’s not obvious to me why you can’t use it from Julia (at least when it’s a vector of Uint32 or other bittypes):
https://llvm.org/doxygen/SmallVector_8h_source.html#l01181

nor the difference between it and llvm:ScalableVectorType:
https://llvm.org/doxygen/classllvm_1_1VectorType.html

If it’s just a problem of calling C++ (Cxx.jl) then it seems to me same could be done in C, and Julia could call such.

Here’s a more faithful translation of the C++ code:

KMeans1D.jl
module Kmeans1D


function smawk(nrows, ncols, lookup, k)
    result = Vector{Int32}(undef, nrows)
    rows = collect(Int32.(1:nrows))
    cols = collect(Int32.(1:ncols))
    smawk!(result, rows, cols, lookup, k)
end

function smawk!(result, rows, cols, lookup, k)
    length(rows) == 0 && return result

    ## REDUCE
    #
    _cols = Int32[]
    for col in cols
        @inbounds while true
            length(_cols) == 0 && break
            row = rows[length(_cols)]
            if lookup(k, row, col) >= lookup(k, row, _cols[end])
                break
            end
            pop!(_cols)
        end
        if length(_cols) < length(rows)
            push!(_cols, col)
        end
    end

    # Call recursively on odd-indexed rows
    odd_rows = Int32[]
    @inbounds for i in 2:2:length(rows)
        push!(odd_rows, rows[i])
    end

    smawk!(result, odd_rows, _cols, lookup, k)

    col_idx_lookup = Dict{Int32, Int32}()
    @inbounds for idx in 1:length(_cols)
        col_idx_lookup[_cols[idx]] = idx
    end

    ## INTERPOLATE

    # Fill-in even-indexed rows
    start = 1
    @inbounds for r in 1:2:length(rows)
        row = rows[r]
        stop = length(_cols) - 1
        if r < (length(rows) - 1)
            stop = col_idx_lookup[result[rows[r + 1]]]
        end
        argmin = _cols[start]
        min = lookup(k, row, argmin)
        for c in start+1:stop+1
            value = lookup(k, row, _cols[c])
            if (c == start) || (value < min)
                argmin = _cols[c]
                min = value
            end
        end
        result[row] = argmin
        start = stop
    end

    return result
end

struct CostCalculator
    cumsum::Vector{Float32}
    cumsum2::Vector{Float32}

    function CostCalculator(array, n::Integer)
        cumsum = Float32[0.0]
        cumsum2 = Float32[0.0]
        @inbounds for i in 1:n
            x = array[i]
            push!(cumsum, x + cumsum[i])
            push!(cumsum2, x * x + cumsum2[i])
        end
        return new(cumsum, cumsum2)
    end
end

function calc(cc::CostCalculator, i, j)
    if j < i
        return zero(eltype(cc.cumsum))
    end

    mu = (cc.cumsum[j + 1] - cc.cumsum[i]) / (j - i + 1)
    result = cc.cumsum2[j + 1] - cc.cumsum2[i]
    result += (j - i + 1) * (mu * mu)
    result -= (2 * mu) * (cc.cumsum[j + 1] - cc.cumsum[i])

    return result
end


struct LookUp
    calculator::CostCalculator
    D::Matrix{Float32}
end

function (lu::LookUp)(k, i, j)
    col = min(i, j - 1)
    if col == 0
        col = size(lu.D, 2) + col
    end
    return lu.D[k - 1, col] + calc(lu.calculator, j, i)
end

function cluster(array, k)
    n = length(array)
    return cluster(array, n, min(k, n))
end

function cluster(array, n, k)
    # Sort input array and save info for de-sorting
    sort_idx = sortperm(array)
    undo_sort_lookup = Vector{Int32}(undef, n)
    sorted_array = Vector{Float32}(undef, n)

    @inbounds for i in 1:n
        sorted_array[i] = array[sort_idx[i]]
        undo_sort_lookup[sort_idx[i]] = i
    end

    #Set D and T using dynamic programming algorithm
    cost_calculator = CostCalculator(sorted_array, n)
    D = Matrix{Float32}(undef, k, n)
    T = Matrix{Int32}(undef, k, n)
    lookup = LookUp(cost_calculator, D)

    @inbounds for i in 1:n
        D[1, i] = calc(cost_calculator, 1, i)
        T[1, i] = 1
    end

    for k_ in 2:k
        row_argmins = smawk(n, n, lookup, k_)
        @inbounds for i in 1:n
            argmin = row_argmins[i]
            min = lookup(k_, i, argmin)
            D[k_, i] = min
            T[k_, i] = argmin
        end
    end

    #Extract cluster assignments by backtracking
    centroids = zeros(k)
    sorted_clusters = Vector{Int32}(undef, n)
    t = n + 1
    k_ = k
    n_ = n

    @inbounds while t > 1
        t_ = t
        t = T[k_, n_]
        centroid = 0.0
        for i in t:t_-1
            sorted_clusters[i] = k_
            centroid += (sorted_array[i] - centroid) / (i - t + 1)
        end
        centroids[k_] = centroid
        k_ -= 1
        n_ = t - 1
    end

    clusters = Vector{Int32}(undef, n)
    @inbounds for i in 1:n
        clusters[i] = sorted_clusters[undo_sort_lookup[i]]
    end

    return centroids, clusters
end


end  # module

As others have mentioned, there are issues regarding the type instability of calc and the use of some globals, but the main issue is that your CostCalculator has abstract fields Array{Float32} as opposed Vector{Float32} (AKA Array{Float32, 1}), so julia cannot really optimize the code paths involving CostCalculator.

With the code above I get around 5 ms.

9 Likes