Left join algorithm for columnar tables

Hello
I’m making custom Tables (Dict{Symbol, Vector} basically) since there seems to be no library matching my needs (columnar tables optimized for data insertions). I’m just a beginner (few days into Julia). I have a working project in q/kdb+ that calculates financial asset portfolios and positions in real time (it’s compicated). kdb+ license is too costly and I have to resort to other options. I’ve been making progress with my Table library,
however I can’t figure out how to implement fast left joins or any left joins… and I haven’t implemented keyed tables yet. Never worked with that at such low level. Do I have to use Dictionaries ? Also it seems like I can’t initialize them fast enough (in kdb+ you can just do “dict = l1 ! l2”). I’ve been able to match or exceed kdb+'s speed with everything so far (initialization, where clauses, row/column indexing with different iterables, adding tables, insertions, updates of singular column)
Please help

I think you’re looking for the package DataFrames. It has a function leftjoin, and more generally is designed to work with tabular data.

I think OPs point was quite explicitly that existing libraries don’t match their needs.

OP well done for matching kdb+ speeds on various use cases, that’s impressive.

Maybe look at Flexijoins.jl and SplitApplyCombine.jl for packages that implement left joins for general data structures.

1 Like

Thanks, I figured how to do what I want. Just as simple as findfirst in joined table’ key column(s) (predefined Vector{Tuple} with collect(zip()) in case of 2+ key columns) for each key of receiving table (seems like there’s no other way to do that quicker, and ordered joined table might improve performance with binary search)
SplitApplyCombine.jl seems useful for kdb’s “select by” statements which I’ll have to replicate as well

FlexiJoins.jl support all kind of joins – left/inner/right/outer, both equi- and nonequi- joins, by distance, by closest, etc.
The performance should be asymptotically optimal (eg, O(n + m) instead of O(n*m)), but can easily have a factor of a few compared to fully optimized implementations. There’s definitely a lot of opportunities there to improve this constant factor…

FlexiJoins.jl natively supports any Julia collection. Among collections/tables with columnar storage, I can heavily recommend StructArrays.jl. I wonder, what exactly is missing StructArrays, any specific feature that prompted you to design another columnar type?

1 Like