`varinfo()` on a large SimpleGraph or DataFrame is prohibitively slow

The only time I use varinfo is when I have a large object in memory and I want to know how much space it is taking. Unfortunately, in many scenarios varinfo runtime goes through the roof for large objects, so it’s impossible to know how large your objects are.

Here is some benchmarking. First for a SimpleGraph and then for a DataFrame.

SimpleGraph

n = 10 .^ (1:6)

g1 = SimpleGraph(n[1])
g2 = SimpleGraph(n[2])
g3 = SimpleGraph(n[3])
g4 = SimpleGraph(n[4])
g5 = SimpleGraph(n[5])
g6 = SimpleGraph(n[6])

t_g = zeros(6)
for i in 1:6
    regex = Regex("g$i")
    t_g[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end

plot(
    log10.(n),
    log10.(t_g),
    legend=false,
    title="varinfo() runtime on SimpleGraph",
    xlab="log10(n)",
    ylab="log10(t_g)",
    seriestype=:line,
    markershape=:circle,
    aspectratio=1,
    ticks=-4:7,
    dpi=300
)

julia> DataFrame(n=n, t=t_g)
6×2 DataFrame
│ Row │ n       │ t           │
│     │ Int64   │ Float64     │
├─────┼─────────┼─────────────┤
│ 1   │ 10      │ 2.4053e-5   │
│ 2   │ 100     │ 5.9977e-5   │
│ 3   │ 1000    │ 0.000298084 │
│ 4   │ 10000   │ 0.00399715  │
│ 5   │ 100000  │ 0.0445316   │
│ 6   │ 1000000 │ 3.06422     │

DataFrame

df1 = DataFrame(x = [randstring(8) for _ in 1:n[1]])
df2 = DataFrame(x = [randstring(8) for _ in 1:n[2]])
df3 = DataFrame(x = [randstring(8) for _ in 1:n[3]])
df4 = DataFrame(x = [randstring(8) for _ in 1:n[4]])
df5 = DataFrame(x = [randstring(8) for _ in 1:n[5]])
df6 = DataFrame(x = [randstring(8) for _ in 1:n[6]])

t_df = zeros(6)
for i in 1:6
    regex = Regex("df$i")
    t_df[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end

plot(
    log10.(n),
    log10.(t_df),
    legend=false,
    title="varinfo() runtime on DataFrame",
    xlab="log10(n)",
    ylab="log10(t_df)",
    seriestype=:line,
    markershape=:circle,
    aspectratio=1,
    ticks=-4:7,
    dpi=300
)

julia> DataFrame(n=n, t=t_df)
6×2 DataFrame
│ Row │ n       │ t           │
│     │ Int64   │ Float64     │
├─────┼─────────┼─────────────┤
│ 1   │ 10      │ 1.9739e-5   │
│ 2   │ 100     │ 4.3522e-5   │
│ 3   │ 1000    │ 0.000242769 │
│ 4   │ 10000   │ 0.00209143  │
│ 5   │ 100000  │ 0.0252908   │
│ 6   │ 1000000 │ 0.987592    │

Discussion

You can see from the graphs and the tables that the runtime grows faster than linearly when the objects get large. Running the benchmark for the n = 1e7 case isn’t possible because it just takes too long.

Is there a better way to get size estimates for large objects in memory? Maybe varinfo could be modified to use some approximations for large objects so that it runs in a reasonable amount of time?

Addendum

I noticed that if the DataFrame contains only numbers, e.g. DataFrame(x = rand(1_000_000)), then varinfo runs in a fraction of a second, even for very large data frames.