Hello,
while I guess standard Array{T,n}
are the most efficient data structures for positional access, I have often the need to access data by some keys.
I am currently using DataFrames to access this data, using either boolean selection or @where(:x .== a, :y .==b,..)
.
I did noticed however that with the DataFrame approach all values are tested, at least on the first test (in this DataFrame, or more likely the compiler, is still efficient in the sense that further tests seems to be performed only on those rows passing the previous tests).
In server-like databases you can specify which columns are to be “indexed”, that for what I understood means that an ordered map is created to the individual records so that it is faster to look-up their values. The downside is that on each insert the map has to be somehow rebuilt, making insert statements slower, i.e. there is a trade-off between select and insert data.
When I write equations, I typically write one assignment every 3-4 look-ups, so I would like the data-structure to be optimised for lookup queries (memory is somehow less an issue for my user case).
I did try to pool() my categorical columns, assuming that this would have created such kind of “index”, but while the memory needed to store the DataFrame effectively decreased, the time to look-up the data actually increased.
I did also try to convert the DataFrame in a dictionary where the key is a tuple of the categorical columns (e.g. Dict((dim1, dim2,..) => value)
but the look-up speed increased by just 10-15%.
Are there other approaches that you would suggest? Sub-classing AbstractArray and build my own index algorithm? Use Dictionaries under the form Dict(dim1 => Dict(dim2 => Dict(.. value)))
? Try to exploit for lookup speed up the indexes created for PooledArrays ?