JuliaDB, tutorial with large datasets and other questions

Dear fellow julians,

I’ve been learning the usage of JuliaDB to wrangle with large datasets (couple of millions of entries) and I don’t think I’m quite understanding the difference between in-memory and out-of-memory processing.

My first question is: let’s say I loaded a table from .csv with loadtable() providing a filepath for output. This should load the table even if it’s larger than my laptops memory. But, what if I then merge/select/transform/join? Is the output of those operations still on disk? or in memory?
What if I save the table as a binary and load it in another session (does it go to memory or still out of memory?)?

Second question: do you have to convert any IndexedTable object to a DIndexedTable to be able to distribute the wrangling? Is there anyway to load a .csv directly into a DIndexedTable? I’m trying to join two IndexedTables and it takes forever or my pc simply runs out of memory, even if my original loading of the data specified an output filepath.

Thrid question: Is there any tutorial with very large datasets? the tutorial with the flights dataset doesn’t seem to help mea lot because I’m using the same functions it explains and the performance on my large dataset is somewhat dissapointing.

I hope my questions aren’t too ambiguous and thanks in advance for your help!

1 Like