Hi!
I’m trying to come up with a way of handling a relatively large volume of data for my laptop (~2GB worth of CSVs), and I’m not 100% sure I’m bumping into RAM limitations, Pluto limitations, limitations of available libraries, or simply lacking knowledge about best practices.
I have a bunch (20+) of CSVs files that I’d ideally like to load into 3 tables in a DB-like object that I can later query in all sorts of ways.
I started with DataFrames, but I had the necessity of appending several to one another, and doing so would’ve eaten all my ram as I had to load all data from CSVs first.
An alternative would have been to do everything in a step-wise fashion, but at some point in time I’m loading in memory more data than I can accommodate. If managing that, would then prevent any sort of analysis as those results would have to be sunk to a dataframe as well (?).
I’ve looked into JuliaDB, but trying to install it results in several other packages in my environment being downgraded, and as I’m trying to become proficient in a set of tools/libraries that I can count on, JuliaDBthe doesn’t seem as solid building block now (not in active development).
SQLite worked when it came to loading data into it, but querying results in long-running cells (at least in Pluto) when querying the same data that is otherwise handled without issues when read directly from the CSV.
What else is out there, when it comes to libraries and/or resources?
I saw the poll involving the future development of DataFrames, and I think what I’m looking for is out-of-memory processing: I’d expect everything to be on-disk, except the query results.