Help speed up SimpleGraph creation

Here’s a MWE of what I’m attempting to do:

using DataFrames
using LightGraphs
using ProgressMeter

df = DataFrame(a = rand(1:50_000, 150_000), b = rand(1:25_000, 150_000)) |> unique!
vs = unique(vcat(df.a, df.b))
n = length(vs)
g = SimpleGraph(n)

p = Progress(n)
Threads.@threads for v in vertices(g)
    connected_vs = filter(row -> row.a == vs[v], df).b
    for cv in connected_vs
        add_edge!(g, v, findfirst(x -> x == cv, vs))
    end
    next!(p)
end

This example will take 10 - 15 minutes to complete on my machine. My real problem is larger and ETA is about 50 minutes. Surely there’s a more performant way?? :crossed_fingers:

EDIT: It looks like the filter function is a primary culprit. Changing to this yields much better results:

df = DataFrame(a = rand(1:5_000, 15_000), b = rand(1:2_500, 15_000)) |> unique!
vs = unique(vcat(df.a, df.b))
n = length(vs)
g = SimpleGraph(n)

Threads.@threads for v in vertices(g)
    for cv in df[df.a .== vs[v], :b]
        add_edge!(g, v, findfirst(x -> x == cv, vs))
    end
end

Indeed. If the data was ordered relative to the a fields, you could use just a slice instead, and that would be much faster.

1 Like

I don’t see a Base.slice, is there a package that exports a slice function?

What I meant was that you could use something like

i = 1 
for v in vertices(g)
     j = findlast(x->x==v,df.a)
     if j != nothing
         cv = df.b[i:j]
         i = j + 1
     end
end

This appears to be about ~3 times faster than using cv = df[df.a .== vs[1], :b], but I noticed
that this alternative appears to be fast enough, isn’t it?

(I didn’t test the code to guarantee that it exactly returns the same thing you want).

1 Like

I see, thanks! Yes, just changing the find function took care of it.