DataFrames vs ndsparse for indexed data

With the release of DataFrames 1.0 I was curious about the performance benefits of accessing data with it vs something like an indexed table. The example would be having tabular data with a column and wanting to return a subtable of items that match the key column.

using DataFrames
using JuliaDB
df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"], age = [20, 14, 65, 34, 23, 67])
df[df.sex .== "male", :]

tbl = ndsparse(df)
tbl[("male",)]

When I try benchmarking on a laptop I get the following results but I am not sure if it is simply because of the tiny size of this example

julia> @btime $df[$df.sex .== "male", :]
  1.450 ΞΌs (21 allocations: 1.58 KiB)
4Γ—2 typename(DataFrame)
β”‚ Row β”‚ sex    β”‚ age   β”‚
β”‚     β”‚ String β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ male   β”‚ 20    β”‚
β”‚ 2   β”‚ male   β”‚ 14    β”‚
β”‚ 3   β”‚ male   β”‚ 34    β”‚
β”‚ 4   β”‚ male   β”‚ 67    β”‚
julia> @btime $tbl[("male",)]
  6.778 ΞΌs (80 allocations: 4.44 KiB)
1-d NDSparse with 4 values (1 field named tuples):
sex    β”‚ age
───────┼────
"male" β”‚ 20
"male" β”‚ 14
"male" β”‚ 34
"male" β”‚ 67

What’s the question? I would try this with larger examples to see whether the differences are meaningful.

Also note that with DataFrames, you can do @view df[df.sex .== "male", :] which allocates less and is around twice as fast on my machine.

1 Like