`varinfo()` on a large SimpleGraph or DataFrame is prohibitively slow

CameronBieganek · May 6, 2020, 12:53am

The only time I use varinfo is when I have a large object in memory and I want to know how much space it is taking. Unfortunately, in many scenarios varinfo runtime goes through the roof for large objects, so it’s impossible to know how large your objects are.

Here is some benchmarking. First for a SimpleGraph and then for a DataFrame.

SimpleGraph

n = 10 .^ (1:6)

g1 = SimpleGraph(n[1])
g2 = SimpleGraph(n[2])
g3 = SimpleGraph(n[3])
g4 = SimpleGraph(n[4])
g5 = SimpleGraph(n[5])
g6 = SimpleGraph(n[6])

t_g = zeros(6)
for i in 1:6
    regex = Regex("g$i")
    t_g[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end

plot(
    log10.(n),
    log10.(t_g),
    legend=false,
    title="varinfo() runtime on SimpleGraph",
    xlab="log10(n)",
    ylab="log10(t_g)",
    seriestype=:line,
    markershape=:circle,
    aspectratio=1,
    ticks=-4:7,
    dpi=300
)

julia> DataFrame(n=n, t=t_g)
6×2 DataFrame
│ Row │ n       │ t           │
│     │ Int64   │ Float64     │
├─────┼─────────┼─────────────┤
│ 1   │ 10      │ 2.4053e-5   │
│ 2   │ 100     │ 5.9977e-5   │
│ 3   │ 1000    │ 0.000298084 │
│ 4   │ 10000   │ 0.00399715  │
│ 5   │ 100000  │ 0.0445316   │
│ 6   │ 1000000 │ 3.06422     │

DataFrame

df1 = DataFrame(x = [randstring(8) for _ in 1:n[1]])
df2 = DataFrame(x = [randstring(8) for _ in 1:n[2]])
df3 = DataFrame(x = [randstring(8) for _ in 1:n[3]])
df4 = DataFrame(x = [randstring(8) for _ in 1:n[4]])
df5 = DataFrame(x = [randstring(8) for _ in 1:n[5]])
df6 = DataFrame(x = [randstring(8) for _ in 1:n[6]])

t_df = zeros(6)
for i in 1:6
    regex = Regex("df$i")
    t_df[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end

plot(
    log10.(n),
    log10.(t_df),
    legend=false,
    title="varinfo() runtime on DataFrame",
    xlab="log10(n)",
    ylab="log10(t_df)",
    seriestype=:line,
    markershape=:circle,
    aspectratio=1,
    ticks=-4:7,
    dpi=300
)

julia> DataFrame(n=n, t=t_df)
6×2 DataFrame
│ Row │ n       │ t           │
│     │ Int64   │ Float64     │
├─────┼─────────┼─────────────┤
│ 1   │ 10      │ 1.9739e-5   │
│ 2   │ 100     │ 4.3522e-5   │
│ 3   │ 1000    │ 0.000242769 │
│ 4   │ 10000   │ 0.00209143  │
│ 5   │ 100000  │ 0.0252908   │
│ 6   │ 1000000 │ 0.987592    │

Discussion

You can see from the graphs and the tables that the runtime grows faster than linearly when the objects get large. Running the benchmark for the n = 1e7 case isn’t possible because it just takes too long.

Is there a better way to get size estimates for large objects in memory? Maybe varinfo could be modified to use some approximations for large objects so that it runs in a reasonable amount of time?

Addendum

I noticed that if the DataFrame contains only numbers, e.g. DataFrame(x = rand(1_000_000)), then varinfo runs in a fraction of a second, even for very large data frames.

Topic		Replies	Views
Varinfo() slow if variable large Performance	4	518	March 6, 2020
Determining size of DataFrame for memory management General Usage memory , dataframes	35	1750	August 4, 2022
Is varinfo() designed to not report size of an array of sparse matrices? Performance memory-allocation , sparse	2	346	January 22, 2021
Using varinfo() very slow when there is 3d data in the struct structure New to Julia question	0	182	October 1, 2023
A bug in using varinfo() when there is 3d data in the struct structure! Performance question , base	4	294	October 3, 2023

`varinfo()` on a large SimpleGraph or DataFrame is prohibitively slow

SimpleGraph

DataFrame

Discussion

Addendum

Related topics