Which are efficient data structures for querying data by name?

sylvaticus · October 6, 2017, 7:32am

Hello,
while I guess standard Array{T,n} are the most efficient data structures for positional access, I have often the need to access data by some keys.
I am currently using DataFrames to access this data, using either boolean selection or @where(:x .== a, :y .==b,..).
I did noticed however that with the DataFrame approach all values are tested, at least on the first test (in this DataFrame, or more likely the compiler, is still efficient in the sense that further tests seems to be performed only on those rows passing the previous tests).
In server-like databases you can specify which columns are to be “indexed”, that for what I understood means that an ordered map is created to the individual records so that it is faster to look-up their values. The downside is that on each insert the map has to be somehow rebuilt, making insert statements slower, i.e. there is a trade-off between select and insert data.
When I write equations, I typically write one assignment every 3-4 look-ups, so I would like the data-structure to be optimised for lookup queries (memory is somehow less an issue for my user case).
I did try to pool() my categorical columns, assuming that this would have created such kind of “index”, but while the memory needed to store the DataFrame effectively decreased, the time to look-up the data actually increased.
I did also try to convert the DataFrame in a dictionary where the key is a tuple of the categorical columns (e.g. Dict((dim1, dim2,..) => value) but the look-up speed increased by just 10-15%.

Are there other approaches that you would suggest? Sub-classing AbstractArray and build my own index algorithm? Use Dictionaries under the form Dict(dim1 => Dict(dim2 => Dict(.. value))) ? Try to exploit for lookup speed up the indexes created for PooledArrays ?

mkborregaard · October 6, 2017, 10:20am

Maybe check out IndexedTables or JuliaDB.jl ?

sylvaticus · October 14, 2017, 12:08pm

Thank you… indeed IndexedTables is ~60 times faster than my implementation with querying DataFrames…

Topic		Replies	Views
[DataFrames Question]: hash-based row indexing for DataFrames package Data question , suggestions	16	2294	October 16, 2019
Can indexes to DataFrame column be added to inprove selection performances? Data	2	404	December 28, 2020
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1255	November 15, 2022
Can I index a DataFrame using a String key? Data indexing , dictionary , dataframes , tables , hash	7	658	November 15, 2022
What data structure to use to hold large categorial dataset for analytics? Data	3	844	March 14, 2017

Which are efficient data structures for querying data by name?

Related topics