The only time I use varinfo
is when I have a large object in memory and I want to know how much space it is taking. Unfortunately, in many scenarios varinfo
runtime goes through the roof for large objects, so it’s impossible to know how large your objects are.
Here is some benchmarking. First for a SimpleGraph
and then for a DataFrame
.
SimpleGraph
n = 10 .^ (1:6)
g1 = SimpleGraph(n[1])
g2 = SimpleGraph(n[2])
g3 = SimpleGraph(n[3])
g4 = SimpleGraph(n[4])
g5 = SimpleGraph(n[5])
g6 = SimpleGraph(n[6])
t_g = zeros(6)
for i in 1:6
regex = Regex("g$i")
t_g[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end
plot(
log10.(n),
log10.(t_g),
legend=false,
title="varinfo() runtime on SimpleGraph",
xlab="log10(n)",
ylab="log10(t_g)",
seriestype=:line,
markershape=:circle,
aspectratio=1,
ticks=-4:7,
dpi=300
)
julia> DataFrame(n=n, t=t_g)
6×2 DataFrame
│ Row │ n │ t │
│ │ Int64 │ Float64 │
├─────┼─────────┼─────────────┤
│ 1 │ 10 │ 2.4053e-5 │
│ 2 │ 100 │ 5.9977e-5 │
│ 3 │ 1000 │ 0.000298084 │
│ 4 │ 10000 │ 0.00399715 │
│ 5 │ 100000 │ 0.0445316 │
│ 6 │ 1000000 │ 3.06422 │
DataFrame
df1 = DataFrame(x = [randstring(8) for _ in 1:n[1]])
df2 = DataFrame(x = [randstring(8) for _ in 1:n[2]])
df3 = DataFrame(x = [randstring(8) for _ in 1:n[3]])
df4 = DataFrame(x = [randstring(8) for _ in 1:n[4]])
df5 = DataFrame(x = [randstring(8) for _ in 1:n[5]])
df6 = DataFrame(x = [randstring(8) for _ in 1:n[6]])
t_df = zeros(6)
for i in 1:6
regex = Regex("df$i")
t_df[i] = time(minimum(@benchmark(varinfo($regex)))) / 1e9
end
plot(
log10.(n),
log10.(t_df),
legend=false,
title="varinfo() runtime on DataFrame",
xlab="log10(n)",
ylab="log10(t_df)",
seriestype=:line,
markershape=:circle,
aspectratio=1,
ticks=-4:7,
dpi=300
)
julia> DataFrame(n=n, t=t_df)
6×2 DataFrame
│ Row │ n │ t │
│ │ Int64 │ Float64 │
├─────┼─────────┼─────────────┤
│ 1 │ 10 │ 1.9739e-5 │
│ 2 │ 100 │ 4.3522e-5 │
│ 3 │ 1000 │ 0.000242769 │
│ 4 │ 10000 │ 0.00209143 │
│ 5 │ 100000 │ 0.0252908 │
│ 6 │ 1000000 │ 0.987592 │
Discussion
You can see from the graphs and the tables that the runtime grows faster than linearly when the objects get large. Running the benchmark for the n = 1e7
case isn’t possible because it just takes too long.
Is there a better way to get size estimates for large objects in memory? Maybe varinfo
could be modified to use some approximations for large objects so that it runs in a reasonable amount of time?
Addendum
I noticed that if the DataFrame contains only numbers, e.g. DataFrame(x = rand(1_000_000))
, then varinfo
runs in a fraction of a second, even for very large data frames.