Confused by extra allocations

I’ve written a function that loops over a large table and is running slower than I expected. The source of the problem seems to be a huge amount of extra allocations. I’ve created the following minimal example that demonstrates the issue:

using DataFrames
using Tables

function f(xs)
  a = Tables.columntable(xs)
  for _ in 1:length(a.x)
  end
end
julia> d = DataFrame((x=rand(10^6)));

julia> @time f(d)
  0.109198 seconds (3.01 M allocations: 61.923 MiB, 8.61% gc time, 24.60% compilation time)

julia> @time f(d)
  0.130291 seconds (3.00 M allocations: 61.020 MiB, 13.04% gc time)

Is this a bug? Why would an empty loop over a range generate millions of allocations (the number of allocations scales with the size of the dataframe).


julia> using DataFrames, Tables

julia> const d = DataFrame((x=rand(10^6)));

julia> @time length(Tables.columntable(d).x);
  0.126591 seconds (155.26 k allocations: 9.318 MiB, 99.71% compilation time)

julia> @time length(Tables.columntable(d).x);
  0.000026 seconds (11 allocations: 400 bytes)

julia> @time for i = 1:length(Tables.columntable(d).x) end;
  0.187395 seconds (3.00 M allocations: 61.020 MiB, 67.25% gc time)

julia> @time for i = 1:length(Tables.columntable(d).x) end;
  0.049252 seconds (3.00 M allocations: 61.020 MiB, 7.41% gc time)

indeed looks weird

1 Like

It’s type unstable. The type of a cannot be inferred, so the type of the range can’t be inferred. So all those allocations will be from unboxing and boxing while iterating over the range.

To avoid this you can put the main body of the function into a second function which accepts a itself.

1 Like

Isn’t length just a number? why would it matter:

julia> @time begin L = length(Tables.columntable(d).x); for i = 1:L end end;
  0.054953 seconds (3.00 M allocations: 61.020 MiB, 13.82% gc time)

julia> @time begin L = length(Tables.columntable(d).x); for i = 1:L÷100 end end;
  0.000486 seconds (28.99 k allocations: 609.469 KiB)

julia> @time begin L = Int64(length(Tables.columntable(d).x)); for i = 1:L end end;
  0.050081 seconds (3.00 M allocations: 61.020 MiB, 9.13% gc time)

julia> @time begin L = Int64(length(Tables.columntable(d).x)); for i = 1:L÷100 end end;
  0.001008 seconds (28.99 k allocations: 609.469 KiB)

length is just a function like any other. If the input types can’t be inferred then the compiler will struggle to infer the output types.

At best the compiler might infer the return type is Integer but this is abstract and the compiler needs to produce code that works with any subtype.