I have a very large database A, and a much smaller one B. Since A is so large, I cannot load it in memory at once. I keep A as a gzip compressed CSV file on disk, and I want to load it lazily. On the other hand B is small and fits well into memory.
The goal is to perform a join between them, on a certain column shared by both A and B. See this example: Joins · DataFrames.jl.
How can I do this, without ever loading A fully into memory?
It’s possible that you could do this with JuliaDB although I admit I have no idea.
A different way could be to sort of “roll your own”. You can stream the larger file using CSV.jl (and CodecZLib.jl I suppose). See CSV.Rows. If you are only joining on one column, writing the join logic shouldn’t be too bad. Look at whatever key value the row of A that you read in has, stick the columns from A that you want onto the B table where the key exists.
For a second I thought that you might be able to use Query.jl, but then I looked at the code again and joins are not streaming in that sense, unfortunately… Thinking about it, I think there is actually no reason why they couldn’t, I would just have to change the implementation. But not on the roadmap right now…
This is possible with JuliaDB, see the docs. I’ve had some problems in the past with these joins and I’m not sure if JuliaDB is being actively maintained at the moment but often it also just worked fine so I’d just give it a try.
An alternative would be to use a database like PostgreSQL.
It can be easily set up using Docker and interplays with Julia and DataFrames.jl using LibPQ.jl.
However, this may be overkill for your use case…