Question: Any plan for hash-based indexing for DataFrame
Suppose the following table is given (actual table that I work on contains about a million records):
df = DataFrame(id=[7,1,5,3,4,6,2], val=[5,6,9,3,4,10,4])
Now I am given some arrays of Ids i.e. ids = [5,4,2] (and more of this kind) and asked to extract rows that match the ids.
For small dataset, I could do like using off-the-shelf method indexin()
:
df[indexin(ids, df.id), :]
But this operation should be too slow to be suitable for large dataset. (with actual dataset and problem given, it took 1.5 hours)
In Pandas however it could be performed effectively by doing as (although Pandas (and python) generally not so comfortable in many other aspects):
df = df.set_index("id")
df.loc[ids, :]
I maybe mimick the approach in Julia DataFrame as (and with this approach, it took 20 seconds):
hashidx = Dict(zip(df.id, 1:nrow(df)))
df[[hashidx[i] for i in ids], :]
I believe this kind of row indexing is required very often in practical data wrangling situations, however not supported in DataFrames package.