Is there a faster alternative to `hcat(xx...)`?

Hello,

I have this piece of code in a function:

  triangles = Matrix{Int}(undef, 3, 0)
  for i in 1:ntriangles
    a, b, c = triangle = ∂₂₀(i)
    nkeep = tokeep[a] + tokeep[b] + tokeep[c]
    if nkeep > 0
      push!(counts, nkeep)
      triangles = hcat(triangles, triangle)
      push!(connec, connect((a, b, c)))
    end
  end

And this function is horribly slow (I have a similar function in R and it is faster). I believe this is due to this piece of code, because there’s almost nothing other than that in the function. I suspected this is due to the repetitive call triangles = hcat(triangles, triangle). So I replaced this piece of code with:

  xx = Vector{Vector{Int}}(undef, 0)
  for i in 1:ntriangles
    a, b, c = triangle = ∂₂₀(i)
    nkeep = tokeep[a] + tokeep[b] + tokeep[c]
    if nkeep > 0
      push!(counts, nkeep)
      push!(xx, triangle)
      push!(connec, connect((a, b, c)))
    end
  end
  triangles = hcat(xx...)

But this is still slow.

Is it known that hcat(xx...) is slow? Is there a faster alternative?

Otherwise, how can one identify which part of a function is slow?

You could use reduce(hcat,xx)' (see How to convert Vector of Vectors to Matrix - #6 by rafael.guerra ), or in Julia 1.9 you can use stack.

To improve speed further, you could maybe post a minimal working example so that people can run the code. (Not visible here, but just a guess: Make sure that you don’t use globals by accident.)

There might also be a way to write the code differently, like

function filter_triangles(triangles, tokeep)
    counts = map(t -> tokeep[t[1]] + tokeep[t[2]] + tokeep[t[3]], triangles)
    keep = counts .> 0
    new_triangles = triangles[keep]
    return (counts = counts[keep], 
            triangles = new_triangles, 
            connec = connect.(new_triangles))  
end

(This uses more memory, but depending on your setting, it might even be faster.)


EDIT: As @fatteneder pointed out, I removed my wrong comment that Int is an abstract type…

The stack function will do this efficiently, see https://github.com/JuliaLang/julia/pull/43334. I think it will be available in Julia v1.9.

You nay want to represent each triangle, not as a Vector, but as a tuple of three values.

Minor correction: Int is not abstract, but Integer is:

julia> isabstracttype(Int)
false

julia> isabstracttype(Integer)
true
3 Likes

Thanks to all of you. Everything you proposed was instructive.

I’ve done numerous attempts to conclude that the slowness is elsewhere!!!

I have

values = [fn(vert) for vert in verts]

where verts is a large vector of vectors of length 3 and fn is LinearAlgebra.norm. And this is the slow part of the function!

But why this code is so slow? It is slower than R. I tried to get rid of the square roots: fn(v) = v[1] * v[1] + v[2] * v[2] + v[3] * v[3] but this is still slow.

Maybe values is not being inferred correctly? Did you try to put the type of the element in front of the [ of the generator? Like values = Int[fn(vert) for vert in verts] (note: I do not know if your array is of Ints, just use the element type there).

I tried to isolate the under investigation part of your code, to make the results clearer.
As you can see your approach allocates 10^4 times more and is 600 to 2*10^3 slower than the other two in comparison.

julia> ntriangles=[[rand(-100:100),rand(-100:100),rand(-100:100)] for _ in 1:10^5]
100000-element Vector{Vector{Int64}}:
 [-14, -33, 29]
 [90, 39, 16]

 ⋮
 
 [63, -86, -31]
 [-67, 84, -76]
 [-5, -4, -98]

julia> @time begin
       triangles = Matrix{Int}(undef, 3, 0)
       for t in ntriangles
           triangles = hcat(triangles, t)
       end
       end
 41.366821 seconds (398.81 k allocations: 111.771 GiB, 23.27% gc time)

julia> @time begin
       triangles = Matrix{Int}(undef, 3, 10^5)
       for i in eachindex(ntriangles)
           triangles[:,i] = ntriangles[i]
       end
       end
  0.070287 seconds (498.47 k allocations: 11.421 MiB)

julia> @time begin
       triangles = Matrix{Int}(undef, 3, 10^5)
       for i in eachindex(ntriangles)
           copyto!(triangles, 3*(i-1)+1, ntriangles[i], 1)   
       end
       end
  0.020577 seconds (598.13 k allocations: 12.941 MiB)

anhoter faster way to do it, if you dont know the number of triangles in adavance

@time begin
    triangles = Vector{Int}()
    for t in ntriangles
        push!(triangles, t...)
    end
    reshape(triangles,3,:)
    end

Thanks!

Still wrong! The slowness is elsewhere. And it is hard to identify where. I will ask to the developers of Meshes.jl, this is the package I’m currently contrbuting to and this is some code for this package.

My new guess is that the culprit is the conversion of the topology:

topo = convert(HalfEdgeTopology, topology(tmesh))

@juliohm What do you think? Is this conversion slow? However that topology is not that big (< 300000 elements). Shame on us if we’re slower than R :wink:

There’s no such conversion in the R code, that would explain the difference of speed.

Will crible my code with some println to find the culprit.

Thank you @rocco_sprmnt21 I will incorporate your suggestions in the code!

@juliohm bingo!

  topo = convert(HalfEdgeTopology, topology(tmesh)) # this is slow
  ∂₂₀ = Boundary{2,0}(topo) # this is slow as well

Let’s continue this discussion on Github or Zulip!

Not sure how you do your benchmarking, but inserting artificial IO operations when benchmarking (e.g. with println) will almost always falsify your benchmark.

Unless you already do, I would recommend using ProfileView or BenchmarkTools to pinpoint the problem.

[quote

I know. But here the slowness is clearly apparent with this rough way of benchmarking. In French we say “il n’y a pas photo !”. That means the difference is very striking and there’s no need of further investigations.

I need to learn some tools for profiling, I don’t know ProfileView yet. Thanks!

I really you should shift your entire code to using tuples or SVectors, since the length of 3 is structurally a part the data. It can have many advantages, both in terms of bug safety and performance.

That being said, the above function should be very fast no matter what, though it would be faster with tuples.

@_stla the HalfEdgeTopology constructor sorts the vertices by default to guarantee coherent orientation. This behavior values correctness over performance, but you can disable sorting if you know a priori that the vertices are all ordered CCW or CW. Check the option sort:

https://juliageometry.github.io/Meshes.jl/stable/meshes.html#Meshes.HalfEdgeTopology

Now regarding the Boundary{2,0}(topo) being slow, I don’t know what may be happening. It just wraps the topology object and doesn’t do anything else.