Actually I am trying to represent a table in Julia. Each column has a name and we should be able to add new column on the fly.
Getting column and row should be fast since these are frequent operations on my work.
Is there any package or tool matching these characteristics ?
Did you check whether the standard DataFrames.jl are good enough for your purposes? I think they are quite okay performancewise. I never did any benchmarks per se but for data analysis of O(100k) rows there was not much delay.
There is quite a bit ecosystem around these, so if they work for you chances are that you find packages for all sorts of operations with them.
Yeah, I checked and it’s a bit too slow, since I am on a real time project
In Julia, you don’t generally need fancy specialized data structures to have performant tables. Some basic tables are:
- vector of namedtuples,
[(a=1, b="x"), ...]
(row-based) - namedtuple of vectors,
(a=[1,2,3,...], b=["x", "y", ...])
(column-based) - StructArrays.jl (column-based storage with row-based interface)
Probably, some of these will work for you
Can you provide an example (MWE)? It is not clear to me:
- how large will your table be? Which data types do you need?
- what do you want to use it for? Mainly reading, or writing, or querying?
- what are your performance requirements? Which operation must be how fast? Are no memory allocations allowed, or some?
We cannot really help you without more details and preferably a test case and test data.
Actually, the table size may reach im extreme case 5B elements
I need to perform resizing on the table frequently (adding rows or columns) and getting data from row or column.
Since it’s a real time project, taking more than 100ns for these operations may not suit me since they are quite frequent
Also, the table grow but don’t shrink.
I will frequently read, write but no querying
I am already exploring StructArrays.jl the problem is that in my case, I need to be able to grow the array, and we can’t change the size of a tuple nor a named tuple, it’s quite complicated
Do you mean 5 billion elements? 5e9 elements? or 5e12 elements? (You might know that billion is defined differently in US and Europe).
And what would be the size of one element?
Again, without a concrete test case, please don’t expect many useful answers.
Maybe a dictionary? Adding “columns” would be fast, adding rows would require pushing to each vector, but that’s quite fast, as memory movements are minimized automatically by the resizing mechanism.
(That said, I’m not sure if what DataFrames does is in the limit of what can be done performance wise)
With (named)tuples, it’s “free” to just create a new one! Adding/removing a StructArray column doesn’t copy any data, even though it creates a new NamedTuple and StructArray.
julia> using StructArrays, Accessors
julia> tbl = StructArray(a=[1,2,3], b=[4,5,6])
3-element StructArray(::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64}:
(a = 1, b = 4)
(a = 2, b = 5)
(a = 3, b = 6)
julia> @insert tbl.c = rand(3)
3-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Float64}) with eltype @NamedTuple{a::Int64, b::Int64, c::Float64}:
(a = 1, b = 4, c = 0.6328852033561965)
(a = 2, b = 5, c = 0.6543847896731916)
(a = 3, b = 6, c = 0.49184492764291854)
I think I will use this, I wasn’t sure if having to go through the whole dictionary to reconstruct a row would be good for performances