Lazy database join

e3c6 · April 9, 2020, 11:55pm

I have a very large database A, and a much smaller one B. Since A is so large, I cannot load it in memory at once. I keep A as a gzip compressed CSV file on disk, and I want to load it lazily. On the other hand B is small and fits well into memory.

The goal is to perform a join between them, on a certain column shared by both A and B. See this example: Joins · DataFrames.jl.

How can I do this, without ever loading A fully into memory?

tbeason · April 10, 2020, 12:48am

It’s possible that you could do this with JuliaDB although I admit I have no idea.

A different way could be to sort of “roll your own”. You can stream the larger file using CSV.jl (and CodecZLib.jl I suppose). See CSV.Rows. If you are only joining on one column, writing the join logic shouldn’t be too bad. Look at whatever key value the row of A that you read in has, stick the columns from A that you want onto the B table where the key exists.

davidanthoff · April 10, 2020, 12:55am

For a second I thought that you might be able to use Query.jl, but then I looked at the code again and joins are not streaming in that sense, unfortunately… Thinking about it, I think there is actually no reason why they couldn’t, I would just have to change the implementation. But not on the roadmap right now…

Jakob · April 10, 2020, 6:45am

This is possible with JuliaDB, see the docs. I’ve had some problems in the past with these joins and I’m not sure if JuliaDB is being actively maintained at the moment but often it also just worked fine so I’d just give it a try.

lungben · April 10, 2020, 6:54am

An alternative would be to use a database like PostgreSQL.
It can be easily set up using Docker and interplays with Julia and DataFrames.jl using LibPQ.jl.
However, this may be overkill for your use case…

Topic		Replies	Views
What's the best way to work with millions of rows of data? Performance	7	2082	February 24, 2020
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9354	January 1, 2025
JuliaDB, tutorial with large datasets and other questions General Usage tutorials	0	830	January 20, 2020
Larger than memory table format with lazy reads? General Usage array , dataframes , juliadb	5	1201	August 19, 2020
[ANN] JDBC.jl Rejuvenated Data announcement , dataframes	3	1222	March 20, 2018

Lazy database join

Related topics