Struggling with Julia and large datasets

I would definitely give a try to ClickHouse. It was designed to handle log files, and its performance always amazes me. (e.g: Summary of the 1.1 Billion Taxi Rides Benchmarks), I guess it can fit the 4TB csv into 1TB memory with some margin. (Especially if some of the strings are “low cardinality” )

3 Likes

Some people are suggesting Arrow, but, if I understand correctly, Arrow is an interchange format, allowing different tools to access the same data in memory.

Maybe I am missing something. How would Arrow help here?

Before data can be processed it should be loaded and this can be a huge overhead on its own. Arrow is better suited for data storage since it has metadata (no need to guess the type of the variables) it is columnar (you can download data without further moving it in memory) and it has many other nice things.

Here is an interesting reading, where the performance of Arrow and CSV is compared: Swift as an Arrow.jl | Blog by Bogumił Kamiński

3 Likes

Yes it makes sense that any binary format is better than CSV, but from what I have read, Arrow “files” are not meant to store data, and instead should be used for inter-process communication.

The “Relation to other projects” section in the FAQ stresses that Arrow is an in-memory format, and so not really comparable to CSV, and not really suitable for storage on-disk. They instead suggest using Parqet for storage.

Parquet, and other binary storage formats like HDF5, have other advantages, in particular compression.

If you are doing the analysis in Julia and have no need for other tools to access the data in RAM, then I don’t think Arrow helps.

2 Likes

Well, FAQ says

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

So if you want better space storage capabilities, you should use Parquet. If you want fast load and direct analytical computations then you should go with Arrow.

Both of them better than CSV anyway. And yes, with 10 Tb data I would choose parquet too, efficient space storage is more important in this case than ease of load and manipulation.

5 Likes

I am going to be real dumb here. Please feel free to hit me with the clue stick.
We are predicating the discussion on having high line by line log files (in some ASCII format) on disk

Should we be asking about directly streaming the data

Sorry if I am confusing things here.

1 Like

The last time I wanted to use Parquet from Julia not all data types were supported for writing in Parquet.jl. Reading seems a bit more mature. As Arrow.jl is somewhat simpler it appears to have better reading and writing support in Julia overall and so I chose Arrow. Plus, Arrow also has some of the features of Parquet, e.g. compression and dictionary encoding, which might not give the same space-savings as Parquet, but it’s still a nice added value. If you only need to read parquet from Julia, or don’t need to write some of the more exotic data types not supported in Parquet.jl then all of the above might not be a problem, of course.

I also found the whole Parquet versus Arrow distinction quite messy, specifically because both are developed under the Apache flag, have overlapping goals and features and seem to be competing on some level.

4 Likes

Thank you for all of the great feedback and links.

The current state is:
We have 2 physical servers, each with 64 cores (128 logical with hyperthreading), 1TB of ram, 20TB of NVMe drives in RAID6 setup.

We’ve setup a mongo db on one of them and are currently importing data.

Just use ClickHouse. PS: the title of this thread seems misleading because your issue has little to do with Julia.

1 Like

I’ve been playing around with parsing the raw files and saving the results in Arrow and HDF5 formats. I’m still wrapping my head around both.

I’m really starting to re-think my approach to use more functions, and then wrap my head around how julia handles threads/processes. Right now, everything is running on 1 cpus, with 63 others sitting idle.

Mongo is on the average of tens of seconds or minutes per data manipulation, part of it is just due to mongo’s methods and the network transit time between servers. We chose it because a co-worker has experience with it. Basically, it’s a very slow data store, which is why I was trying to figure out a way to analyze the data in pure julia.

I think between arrow, dataframes, and datastructures, julia should be able to do it all on a single server using all of the cpus, and fit in available memory.

Thank you for all of the great feedback and pointers. I’m still overwhelmed by the julia ecosystem, but slowing wrapping my head around it.

1 Like

The main question is, what type of data you actually have and what you wan to do with it: if it’s sensor data, then it’s very likely a time-stamped series of potentially multi-dimensional numeric data with a physical meaning on which you want to perform computations. If this is the case then
a) please do not use any fancy DB and
b) never store data at that scale as ASCII (unless you’re forced to).

Both options are performance killers.
Simple example: storing a 64-bit (8 byte) floating point number as ASCII will roughly increase it’s size (and thus file size and IO load) by a factor of 3, will introduce roundoff errors twice (when writing and reading the number) and will cost you computing time on top for the ASCII<-> binary conversion.

Having spent many years of my life in HPC, where we used to read/write petabytes of data on O(10^5) cores, I would recommend to go with HDF5 or to define some custom binary file format. There is no other IO lib other then HDF5 that offers such a high level of performance and maturity for numeric data.

If the data is really ASCII and you have access to the code that outputs it, then make it write the data in a binary format (e.g. using HDF5) which you can then further process. In some settings this alone may give you a magnitude speedup. If you cannot change the creation of the dataset then think about converting it prior to your computation.

Other things in HDF5 that may give you potential speedup:

  • optional transparent compression on write / decompression on read will further reduce size letting you trade computing time for IO
  • chunked IO that lets you read those portions of the dataset that you actually need
  • extensive support lets you conveniently inspect or visualize the data with various tools (HDFView, Paraview, …)
  • support for parallel IO using mpio for parallel processing (if you’re willing to dig into MPI, in Julia it’s the packages HDF5.jl and MPI.jl)
14 Likes

I suggest the task/channel approach described here Asynchronous Programming · The Julia Language

In retrospect, you are correct. I should have re-titled the post to something like “struggling with julia and large datasets”

1 Like

You can still edit the title.

Makes me suspicious… are you training a neural network or are you computing summary statistics? And do you have some knowledge how many lines you need ball park, what estimates and accuracy do you need. What happens if you use a million instead?

This might be the easiest way and it should be really easy to try out as well since you have a rather beefy hardware so you can use it in a single machine mode which makes the setup trivial. So I would

  1. install Spark
  2. use Spark to convert the CSVs into a Parquet file (partitioning based on your intended access pattern)
  3. try a representative query against the Parquet file and if the performance is good enough then do the heavy lifting with Spark (or, if appropriate, even develop the whole analysis using Spark, leveraging user-defined-functions as needed).

A quick motivational read: Benchmarking Apache Spark on a Single Node Machine

1 Like

Apparently after some event (timer expires or replies), the ability to edit goes away.

Thanks, I’ll check it out.

Do you know how Spark and Dask compare on this single machine mode? I did a much smaller data handling thing on my laptop, but it doesn’t fit in memory. Pandas/Dask was just done in an amazingly short time. I couldn’t figure out which Julia tools I could have used.

It depends on what you are doing of course but you can get some sense from the H2O benchmark.

1 Like