Struggling with Julia and large datasets

Well, FAQ says

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

So if you want better space storage capabilities, you should use Parquet. If you want fast load and direct analytical computations then you should go with Arrow.

Both of them better than CSV anyway. And yes, with 10 Tb data I would choose parquet too, efficient space storage is more important in this case than ease of load and manipulation.

6 Likes