DataFrames vs ndsparse for indexed data

jekyllstein · May 5, 2021, 12:22am

With the release of DataFrames 1.0 I was curious about the performance benefits of accessing data with it vs something like an indexed table. The example would be having tabular data with a column and wanting to return a subtable of items that match the key column.

using DataFrames
using JuliaDB
df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"], age = [20, 14, 65, 34, 23, 67])
df[df.sex .== "male", :]

tbl = ndsparse(df)
tbl[("male",)]

When I try benchmarking on a laptop I get the following results but I am not sure if it is simply because of the tiny size of this example

julia> @btime $df[$df.sex .== "male", :]
  1.450 μs (21 allocations: 1.58 KiB)
4×2 typename(DataFrame)
│ Row │ sex    │ age   │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ male   │ 20    │
│ 2   │ male   │ 14    │
│ 3   │ male   │ 34    │
│ 4   │ male   │ 67    │
julia> @btime $tbl[("male",)]
  6.778 μs (80 allocations: 4.44 KiB)
1-d NDSparse with 4 values (1 field named tuples):
sex    │ age
───────┼────
"male" │ 20
"male" │ 14
"male" │ 34
"male" │ 67

nilshg · May 5, 2021, 9:44am

What’s the question? I would try this with larger examples to see whether the differences are meaningful.

Also note that with DataFrames, you can do @view df[df.sex .== "male", :] which allocates less and is around twice as fast on my machine.

Topic		Replies	Views
[DataFrames Question]: hash-based row indexing for DataFrames package Data question , suggestions	16	2247	October 16, 2019
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1186	November 15, 2022
Can't use NDSparse Created Using Query.jl New to Julia jump , query	2	560	January 22, 2020
Accessing a column value from DataFrameRow allocates Performance dataframes	10	846	March 7, 2022
Hierarchical or multi-index for data frames Data	10	7397	October 9, 2019

DataFrames vs ndsparse for indexed data

Related topics