Identifying nodes in a Graph


I have data on users for whom I’d like to create a bipartite network that connects them through their consumption decisions. Since the data is rather big I’m trying to find an efficient implementation to create the network (so any tips in that direction are appreciated). Basically, I (think I) need a simple weighted graph with edges weighted by the number of times a particular item was consumed. In the end I would like to extract e.g. the centrality and add it to the original DataFrame. So my main question is how to best preserve “identifyability” of each node in the network. One way would be to use a MetaGraph and add an ID to each node. Alternatively if the nodes are added iteratively I could keep track of ID => ith node.

I am just getting started with network analysis so please correct me on anything I am saying. Any help is appreciated. Thanks!


I should add an example of what I am doing so far:

using DataFrames, Dates, Plots, Random
using LightGraphs, SimpleWeightedGraphs, GraphRecipes

customers = [randstring(3) for _ in 1:15]
df = DataFrame(customer = vcat(customers, customers),
                item = rand(["apple", "bread", "banana"], 30),
               date = Date(2020, 01, 01) .+ Day.(rand(1:100, 30))

dfg = combine(nrow => :weight,  groupby(df, [:customer, :item]))
verts = sort([unique(df.customer)..., unique(df.item)...])
G = SimpleWeightedDiGraph(length(verts))
labels = Dict() # for plot
for row in eachrow(dfg)
        s = searchsortedfirst(verts, row.customer)
        d = searchsortedfirst(verts, row.item)
        w = row.weight
        add_edge!(G, SimpleWeightedEdge(s, d, w))
        labels[(s,d)] = w

centr = eigenvector_centrality(G)
dfn = DataFrame(vertex = verts, centrality = centr)
plot(verts, centr,
                seriestype = :scatter,
                legend = false,
                xrotation = 60,
                xticks = :all)
graphplot(G, names = verts, edgelabel = labels, arrow = true)

One way would be to use a MetaGraph and add an ID to each node.

Yes. That’s probably the easiest way to do it. It will handle your weights field as well. Try out the new MetaGraphsNext package (cc @bramtayl) for some extra fun and speed.

Alternatively if the nodes are added iteratively I could keep track of ID => ith node.

You could store the forward mapping (vertex number -> ID) in a vector, and the reverse mapping (ID -> vertex number) in a dictionary.

1 Like

Thank you for the reply! So far I have improved to the following which seems to be reasonably fast:

using DataFrames, Base.Threads, SimpleWeightedGraphs

function makebigraph(df, customer, item)
        dfg = combine(nrow => :weight,  groupby(df, [customer, item]))
        verts = [unique(df[!, customer])..., unique(df[!, item])...]
        vdict = Dict(verts .=> 1:length(verts))
        src = Vector{Int}(undef, nrow(dfg))
        dst = Vector{Int}(undef, nrow(dfg))
        @threads for iter in 1:nrow(dfg)
                src[iter] = vdict[dfg[iter, customer]]
                dst[iter] = vdict[dfg[iter, item]]
        G = SimpleWeightedDiGraph(src, dst, dfg[!,:weight])
        return G, verts

Can MetaGraphs also be created from three vectors (source, destination, weight)? This seems to be much faster partly due to possible multithreading. This is what I came up with for MetaGraphsNext which works but does not seem very elegant (and takes ~5x as long):

using DataFrames
using LightGraphs, MetaGraphsNext

function makemetagraph(df, customer, item)
        dfg = combine(nrow => :weight,  groupby(df, [customer, item]))
        verts = [unique(df.customer)..., unique(df.item)...]
        G = MetaGraph(DiGraph(), EdgeMeta = Int64, defaultweight = 0, weightfunction = identity)
        for v in verts
                G[Symbol(v)] = nothing
        for r in eachrow(dfg)
                G[Symbol(r[customer]), Symbol(r[item])] = r[:weight]
        return G

mm = makemetagraph(df, :customer, :item)
gg, verts = makebigraph(df, :customer, :item)
centrG = closeness_centrality(gg)
centrM = closeness_centrality(mm)
centrM == centrG #true

Any tips on improving this?

@danielw2904 what about something like this?

function makemetagraph(df, customer, item)
        G = MetaGraph(
                EdgeMeta = Int64,
                defaultweight = 0,
                weightfunction = identity,
        for group in groupby(df, [customer, item])
                customer_id = Symbol(group[1, customer])
                item_name = Symbol(group[1, item])
                G[customer_id] = nothing
                G[item_name] = nothing
                G[customer_id, item_name] = size(group, 1)
        return G

That is much nicer. Thank you!