[DataFrames Question]: hash-based row indexing for DataFrames package

Question: Any plan for hash-based indexing for DataFrame

Suppose the following table is given (actual table that I work on contains about a million records):

df = DataFrame(id=[7,1,5,3,4,6,2], val=[5,6,9,3,4,10,4])

Now I am given some arrays of Ids i.e. ids = [5,4,2] (and more of this kind) and asked to extract rows that match the ids.

For small dataset, I could do like using off-the-shelf method indexin():

df[indexin(ids, df.id), :]

But this operation should be too slow to be suitable for large dataset. (with actual dataset and problem given, it took 1.5 hours)

In Pandas however it could be performed effectively by doing as (although Pandas (and python) generally not so comfortable in many other aspects):

df = df.set_index("id")
df.loc[ids, :]

I maybe mimick the approach in Julia DataFrame as (and with this approach, it took 20 seconds):

hashidx = Dict(zip(df.id, 1:nrow(df)))
df[[hashidx[i] for i in ids], :]

I believe this kind of row indexing is required very often in practical data wrangling situations, however not supported in DataFrames package.

As a middle ground, you can use Query:

julia> using Random, DataFrames, Query, BenchmarkTools

julia> df = DataFrame(id=shuffle(1:10^6), val=shuffle(1:10^6));

julia> want_ids = Set(rand(1:10^6, 10^5));

julia> @btime $df[indexin($want_ids, $df.id), :]
  81.002 ms (80 allocations: 67.44 MiB)

julia> @btime $df |>
       @filter(_.id in $want_ids) |>
       DataFrame
  20.035 ms (91 allocations: 4.00 MiB)

this is slower than your hash table, but way more flexible, and I agree there should be some hash-based approach maybe just I don’t know

1 Like

Thanks @jling
I just experimented. Still very slow for processing real life data but It’s good to have learned another good trick!

how large exactly is your data? million rows and how many columns?

That’s a spectacularly inefficient way to index into a DataFrame. Try this

using DataFrames

df = DataFrame(
    id=rand(1:1_000, 100_000_000),
    val=rand(100_000_000))

using BenchmarkTools

filter_ds(df, ids) = begin
    index = in.(df.id, Ref(ids))
    df[index, :]
end

@benchmark df2 = filter_ds($df, $ids)

@time df2 = filter_ds(df, ids)

BTW, the idea is you can use true/false to index into the array. See

DataFrame(a = 1:3)[[true, false, true], :]

Thanks @xiaodai

your suggestion shows much better speed than that of indexin or Query.jl but just a few order of magnitudes.
Still it can’t beat Dict’s hash.

Since @jling just asked about how large the data is, here it is:

This is to reproduce the following paper’s result (Churn prediction)

Paper: https://www.andrew.cmu.edu/user/lakoglu/pubs/StackOverflow-churn.pdf

Description of datasets: https://ia800107.us.archive.org/27/items/stackexchange/readme.txt

Some reduced datasets (users_reduce.csv, posts.reduce.csv) are available at:
https://drive.google.com/open?id=1Fp_7GDH_t7xfnU8aXeKrcBC54_nECOcu

length of records are

julia> nrow(users)
992110

julia> nrow(posts)
9523987

respectively.

1 Like

good point, I wasn’t thinking earlier, thanks for the source! now I wonder why Query doesn’t do this, maybe they should add a specialized filter for this use case

julia> @btime $df[in.($df.id, Ref($want_ids)),:]
  17.471 ms (33 allocations: 2.30 MiB)

It does take some time to build a hash base index. As far as I know DataFrame has a backlog of features and indexing is of them. I don’t think it’s that hard to build a IndexedDataFrame if you want to try.

Otherwise, you can try IndexedTables.jl which already has indexing features, if you must.

You can also sort the data and save it with JDF.jl, so the next time you load, it’s already sorted which beats indexing.

1 Like

@xiaodai, Thank you for the valuable information. I will try that out! :slight_smile:

This is a fast (and correct) way to build a hash dict

build_dict(df) = begin    
    df_index = by(DataFrame(id = df.id, rowid = 1:size(df, 1)), :id, positions = :rowid => x->(x,))
    d = Dict(c.id => c.positions[1] for c in eachrow(df_index))    
    d
end

@time hashidx = build_dict(df)

I don’t think the below in your original post is correct

hashidx = Dict(zip(df.id, 1:nrow(df)))

@xiaodai, your are right. Actually I assumed, the df.id has no duplicates. Your solution is general in that respect. Pandas’s index seems to take only the last one when there are duplicates on the index.

Generally not a fan of Query.jl as it has poor group by performance. It doesn’t scale for modern data wrangling workloads.

Try NDSparse from IndexedTables

nds = ndsparse((id=[7,1,5,3,4,6,2],), (val=[5,6,9,3,4,10,4],))
idxs = [5,4,2]
foo=x->getindex(nds, x).val;
foo.(idxs)  # returns Vector{Int}

or

nds[sort(idxs)]  # returns NDSparse

@zgornel, thank you!

That seems to be the closest solution among off the self methods.
In order to adapt it to my data, I needs some polishing, for example nds[ ] does not accept empty array.
But I need to know how to access column of nds. I imported JuliaDB, IndexedTables but select(nds, :val) kind of operation not working nor nds[:, :val]. I am browsing the doc…but still can’t find.

Columns access:

getproperty(columns(nds), :val)

or simply

columns(nds).val

For row access,
rows(nds) provides a row iterator where each row is a NamedTuple of the form (id=..., val=...).

EDIT: Indeed, the underlying structures for both IdexedTable, NDSparse are elegant enough to allow one defining other access/selection methods. Generally, for NDSparse, low-level access is provided by
the data and index properties (i.e. nds.data, nds.index)

@zgornel, it turns out that using ndsparse for the purpose of fast indexing gives no advantage. I think the caveat is that the indexer indxs must inevitably be sorted every time in the while loop. For millions of records, it is no different than indexin().

1 Like