Dictionary-based arrays โ represent wide heterogeneous tables while enjoying familiar Julia collection and Tables
interfaces.
Use DictArrays
when you need a lean table type, but the compilation overhead of type-stable solutions (Vector{NamedTuple}
, NamedTuple{Vector}
, StructArray
) is too much.
DictArray
s are similar to StructArrays
and have the same interface where possible, with the defining difference that DictArray
s do not encode columns in the table type. This gets rid of the prohibitive compilation overhead for wide tables with 100s of columns or more.
Despite the inherent type instability, regular Julia data manipulation functions such as map
and filter
are fast for DictArray
s: almost no overhead compared to StructArray
s, orders of magnitude faster than plain Vector
s of Dict
s.
Compilation and runtime comparison
No compilation overhead:
# 1000 columns - almost instant
julia> da = @time DictArray(Dictionary(Symbol.(:a, 1:10^3), fill(1:1, 10^3)))
0.001211 seconds (5.50 k allocations: 313.422 KiB)
# while StructArrays start to struggle:
julia> @time StructArray(da);
7.496190 seconds (626.85 k allocations: 37.730 MiB, 0.30% gc time, 99.52% compilation time)
# DictArray compilation doesn't depend on the number of columns
# even absurd hundreds of thousands of columns are fine:
julia> @time DictArray(Dictionary(Symbol.(:a, 1:10^5), [fill(1:1, 2*10^4); fill([1.], 2*10^4); fill([:a], 2*10^4); fill(["a"], 2*10^4); fill([false], 2*10^4)]))
0.228542 seconds (878.81 k allocations: 39.484 MiB, 11.63% gc time, 52.54% compilation time)
At the same time, map
is as fast as for type-stable arrays:
julia> da = DictArray(a=1:10^6, b=collect(1.0:10^6), c=fill("hello", 10^6));
# DictArray
julia> @btime map(x -> x.a + x.b, $da)
1.430 ms (300 allocations: 7.65 MiB)
# fast baseline: StructArray
# basically the same timings
julia> @btime map(x -> x.a + x.b, $(StructArray(da)))
1.314 ms (2 allocations: 7.63 MiB)
# slow baseline: plain Vector of Dictionaries
# orders of magnitude slower, many allocations
julia> @btime map(x -> x.a + x.b, $(collect(da)))
100.512 ms (1000022 allocations: 22.89 MiB)