Here’s a MWE of what I’m attempting to do:
using DataFrames
using LightGraphs
using ProgressMeter
df = DataFrame(a = rand(1:50_000, 150_000), b = rand(1:25_000, 150_000)) |> unique!
vs = unique(vcat(df.a, df.b))
n = length(vs)
g = SimpleGraph(n)
p = Progress(n)
Threads.@threads for v in vertices(g)
connected_vs = filter(row -> row.a == vs[v], df).b
for cv in connected_vs
add_edge!(g, v, findfirst(x -> x == cv, vs))
end
next!(p)
end
This example will take 10 - 15 minutes to complete on my machine. My real problem is larger and ETA is about 50 minutes. Surely there’s a more performant way??
EDIT: It looks like the filter
function is a primary culprit. Changing to this yields much better results:
df = DataFrame(a = rand(1:5_000, 15_000), b = rand(1:2_500, 15_000)) |> unique!
vs = unique(vcat(df.a, df.b))
n = length(vs)
g = SimpleGraph(n)
Threads.@threads for v in vertices(g)
for cv in df[df.a .== vs[v], :b]
add_edge!(g, v, findfirst(x -> x == cv, vs))
end
end