[DataFrames Question]: hash-based row indexing for DataFrames package

Sijun · October 15, 2019, 5:31am

Question: Any plan for hash-based indexing for DataFrame

Suppose the following table is given (actual table that I work on contains about a million records):

df = DataFrame(id=[7,1,5,3,4,6,2], val=[5,6,9,3,4,10,4])

Now I am given some arrays of Ids i.e. ids = [5,4,2] (and more of this kind) and asked to extract rows that match the ids.

For small dataset, I could do like using off-the-shelf method indexin():

df[indexin(ids, df.id), :]

But this operation should be too slow to be suitable for large dataset. (with actual dataset and problem given, it took 1.5 hours)

In Pandas however it could be performed effectively by doing as (although Pandas (and python) generally not so comfortable in many other aspects):

df = df.set_index("id")
df.loc[ids, :]

I maybe mimick the approach in Julia DataFrame as (and with this approach, it took 20 seconds):

hashidx = Dict(zip(df.id, 1:nrow(df)))
df[[hashidx[i] for i in ids], :]

I believe this kind of row indexing is required very often in practical data wrangling situations, however not supported in DataFrames package.

jling · October 15, 2019, 6:18am

As a middle ground, you can use Query:

julia> using Random, DataFrames, Query, BenchmarkTools

julia> df = DataFrame(id=shuffle(1:10^6), val=shuffle(1:10^6));

julia> want_ids = Set(rand(1:10^6, 10^5));

julia> @btime $df[indexin($want_ids, $df.id), :]
  81.002 ms (80 allocations: 67.44 MiB)

julia> @btime $df |>
       @filter(_.id in $want_ids) |>
       DataFrame
  20.035 ms (91 allocations: 4.00 MiB)

this is slower than your hash table, but way more flexible, and I agree there should be some hash-based approach maybe just I don’t know

Sijun · October 15, 2019, 6:36am

Thanks @jling
I just experimented. Still very slow for processing real life data but It’s good to have learned another good trick!

jling · October 15, 2019, 6:39am

how large exactly is your data? million rows and how many columns?

xiaodai · October 15, 2019, 6:47am

That’s a spectacularly inefficient way to index into a DataFrame. Try this

using DataFrames

df = DataFrame(
    id=rand(1:1_000, 100_000_000),
    val=rand(100_000_000))

using BenchmarkTools

filter_ds(df, ids) = begin
    index = in.(df.id, Ref(ids))
    df[index, :]
end

@benchmark df2 = filter_ds($df, $ids)

@time df2 = filter_ds(df, ids)

xiaodai · October 15, 2019, 6:54am

BTW, the idea is you can use true/false to index into the array. See

DataFrame(a = 1:3)[[true, false, true], :]

Sijun · October 15, 2019, 6:56am

Thanks @xiaodai

your suggestion shows much better speed than that of indexin or Query.jl but just a few order of magnitudes.
Still it can’t beat Dict’s hash.

Since @jling just asked about how large the data is, here it is:

This is to reproduce the following paper’s result (Churn prediction)

Paper: https://www.andrew.cmu.edu/user/lakoglu/pubs/StackOverflow-churn.pdf

Description of datasets: https://ia800107.us.archive.org/27/items/stackexchange/readme.txt

Some reduced datasets (users_reduce.csv, posts.reduce.csv) are available at:
https://drive.google.com/open?id=1Fp_7GDH_t7xfnU8aXeKrcBC54_nECOcu

length of records are

julia> nrow(users)
992110

julia> nrow(posts)
9523987

respectively.

jling · October 15, 2019, 6:56am

good point, I wasn’t thinking earlier, thanks for the source! now I wonder why Query doesn’t do this, maybe they should add a specialized filter for this use case

julia> @btime $df[in.($df.id, Ref($want_ids)),:]
  17.471 ms (33 allocations: 2.30 MiB)

xiaodai · October 15, 2019, 7:01am

It does take some time to build a hash base index. As far as I know DataFrame has a backlog of features and indexing is of them. I don’t think it’s that hard to build a IndexedDataFrame if you want to try.

Otherwise, you can try IndexedTables.jl which already has indexing features, if you must.

You can also sort the data and save it with JDF.jl, so the next time you load, it’s already sorted which beats indexing.

Sijun · October 15, 2019, 7:03am

@xiaodai, Thank you for the valuable information. I will try that out!

xiaodai · October 15, 2019, 7:10am

This is a fast (and correct) way to build a hash dict

build_dict(df) = begin    
    df_index = by(DataFrame(id = df.id, rowid = 1:size(df, 1)), :id, positions = :rowid => x->(x,))
    d = Dict(c.id => c.positions[1] for c in eachrow(df_index))    
    d
end

@time hashidx = build_dict(df)

I don’t think the below in your original post is correct

hashidx = Dict(zip(df.id, 1:nrow(df)))

Sijun · October 15, 2019, 7:21am

@xiaodai, your are right. Actually I assumed, the df.id has no duplicates. Your solution is general in that respect. Pandas’s index seems to take only the last one when there are duplicates on the index.

xiaodai · October 15, 2019, 7:31am

Generally not a fan of Query.jl as it has poor group by performance. It doesn’t scale for modern data wrangling workloads.

zgornel · October 15, 2019, 8:04am

Try NDSparse from IndexedTables

nds = ndsparse((id=[7,1,5,3,4,6,2],), (val=[5,6,9,3,4,10,4],))
idxs = [5,4,2]
foo=x->getindex(nds, x).val;
foo.(idxs)  # returns Vector{Int}

or

nds[sort(idxs)]  # returns NDSparse

Sijun · October 15, 2019, 8:53am

@zgornel, thank you!

That seems to be the closest solution among off the self methods.
In order to adapt it to my data, I needs some polishing, for example nds[ ] does not accept empty array.
But I need to know how to access column of nds. I imported JuliaDB, IndexedTables but select(nds, :val) kind of operation not working nor nds[:, :val]. I am browsing the doc…but still can’t find.

zgornel · October 15, 2019, 8:56am

Columns access:

getproperty(columns(nds), :val)

or simply

columns(nds).val

For row access,
rows(nds) provides a row iterator where each row is a NamedTuple of the form (id=..., val=...).

EDIT: Indeed, the underlying structures for both IdexedTable, NDSparse are elegant enough to allow one defining other access/selection methods. Generally, for NDSparse, low-level access is provided by
the data and index properties (i.e. nds.data, nds.index)

Sijun · October 16, 2019, 1:58am

@zgornel, it turns out that using ndsparse for the purpose of fast indexing gives no advantage. I think the caveat is that the indexer indxs must inevitably be sorted every time in the while loop. For millions of records, it is no different than indexin().

Topic		Replies	Views
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	2003	August 25, 2023
Can indexes to DataFrame column be added to inprove selection performances? Data	2	384	December 28, 2020
Data structure for convenient access to tabular data General Usage dataframes	5	390	February 13, 2023
Any plan for functionality like Pandas loc? General Usage dataframes	15	1170	May 27, 2022
DataFrames vs ndsparse for indexed data Performance	1	342	May 5, 2021

[DataFrames Question]: hash-based row indexing for DataFrames package

Question: Any plan for hash-based indexing for DataFrame

Related topics