Struggling with Julia and large datasets

TLDR: Skip to the 3rd paragraph for the questions.

backstory

I’m working with a decent sized dataset, around 100 billion log lines from sensor data, updating at 258 million lines per day at nanosecond resolution. Originally, I quickly wrote some prototype code in ruby to analyze the data from the logfiles in place. After the disk i/o became too great and burnt out disks, we tried a series of databases: sqlite, mariadb, postgresql, lucene (solr and elastic search), cassandra, and now mongo. What I learned is that database administration is a full-time job and distracts from getting results from our data, and also requires hardware which is the equivalent cost of 3-5 grad students. And, none of the databases perform as expected/advertised without massive hardware clusters.

We just received a grant for another six months of progress on this research. Last month, I started to wholesale re-think what we’re doing, which led me to julia on the recommendation of some friends in various other organizations. Compared to ruby, julia is vastly faster at everything. I’ve been able to write new code at far higher productivity and code performance levels than in the first prototype period.

the questions

Working with a statistically valid subset of the data, around 33 billion log lines, what is the “julia way” to work with the data? DataFrames.jl looks nice, but our subset data size is around 10TB and that doesn’t fit in ram (we have 1TB). JuliaDB seems abandoned? Even with this smaller dataset, SQL databases struggle. I wrote a parser, in julia, to parse the raw logs into csv, which greatly reduces the total data size to around 4TB. We did upgrade the servers to pure NVMe disks, so filesystem i/o is vastly faster now, but still 1000x slower than memory.

Could I treat the csvs on the filesystem as a “data base” and write code as queries against those? Is there some other way I’m missing? Do we really have to suffer the slings and arrows of outrageous databases? Could we hire julia computing or some julia consultant/company to help figure this out versus making a grad student suffer through databases and being a codemonkey?

Happy to read whatever I’m missing, while we fight with importing BSON into mongodb this week.

I’m truly impressed with all things julia at this point. julia is a love versus writing c/asm for performance.

Thanks!

21 Likes
  • How many columns are there?
  • What data types?
  • What kind of queries do you need to make on the data?
2 Likes
  1. between 3 and 6 columns
  2. data is timestamp in ISO 8601 format to nanoseconds, Int64, or Strings.
  3. The questions mostly key on timing and graph analysis of sensor events.

I’m very much not an expert, but I’ll make some suggestions, and hopefully somebody else will make better ones.

One big option is to look into memory mapped arrays. This lets you create an array larger than memory, where only part of the array is loaded into memory at a time. See Memory-mapped I/O · The Julia Language

Another option is to look into using distributed Julia. In Julia, a “cluster” is just a bunch of machines with passwordless ssh access to each other, so it’s actually relatively easy to set up. I use KissCluster on AWS. Here are some relevant links:
https://docs.julialang.org/en/v1/stdlib/Distributed/

4 Likes

Might consider using Arrow.jl instead of CSV.

9 Likes

Or maybe something like BigQuery is a better bet.

1 Like

If the data is 4TB as a CSV, you could probably store it more efficiently in a binary format rather than text.

Try converting the dataset to an HDF5 file. If you can’t create the data in one go because loading the whole csv is impossible, you can incrementally write to the file:

https://juliaio.github.io/HDF5.jl/stable/#Chunking-and-compression

6 Likes

Have you tried dedicated time-series/analytical databases? While not as maintenance free as pure Julia, they can process larger than memory datasets efficiently.

QuestDB and ClickHouse are popular options in this arena and have Julia connectors.

Other than that, forget CSVs and stick to binary formats. Arrow or Parquet for column oriented storage, Avro for row oriented is popular.

6 Likes

Thank you for all of the great feedback!

We’re forbidden from using “the cloud” due to various legal issues with the datasets (healthcare/PII), otherwise we could outsource all this db overhead to some cloud provider.

@jzr I’ve not heard about Apache Arrow, so I’ll check it out.

@jebej Will investigate HDF5!

What I am learning is that I should assume we need a db admin as part of our next grant application.

Thanks!

1 Like

@Cloves_Almeida we researched those but it seems like more stuff to manage, but to be fair, we haven’t actually tried them yet.

What I am learning is that I should assume we need a db admin as part of our next grant application.

Depending on what you mean by administration, some of the work can be avoided by using a managed database from a cloud vendor.

I think that was explicitly ruled out earlier because of legal requirements.

1 Like

Definitely +1 for Arrow.jl. You can even create the arrow file in batches from your csvs, and then load it into a (read-only) DataFrame even if it’s larger than memory.

7 Likes

My recommendation would be to do as much within the database. For example, I tend to develop most of the cleaning and pre-processing in a PostgreSQL server and only then I bring what I need to Julia. Arrow might be useful at an intermediate step as needed.

2 Likes

Ah you’re right.

I might check self-hosted TimescaleDB

Using pre-existing dedicated DB might be better as others have suggested, but, if you still want to do it in Julia, FYI, Transducers.jl provides a set of composable and datastructure/input-agnostic tools for data manipulation on sequential, threaded, GPU, and distributed execution contexts. For example, there are a family parallelizable transducers such as GroupBy, ReducePartitionBy, SplitBy (WIP), and Consecutive. They are designed to be very generic; e.g., many of these “advanced” transducers are likely to be supported on GPU. Also, the transducer approach fits very well with out-of-core parallel and streaming execution.

7 Likes

Definitely would recommend to use some distributed event oriented database. Usual relational DB (like PostreSQL, MySQL/MariaDB), document oriented (Mongo) or NoSQL solutions (Cassandra) are not suited well for this kind of task. You should better consider something like Clickhouse, Druid, Pinot, InfluxDB. Or even just plain old Spark, which was designed for this sort of tasks and is rather fast. But yes, these are all “Big Data” solutions and they require some time to set up properly.

You can do something in plain Julia of course, but then at least some ideas from these solutions should be used. Data should be partitioned and indexed, i.e. it should be split in relatively small files, spread over nodes and there should be mechanism to identify where necessary piece of information is located.

Then you need to have some sort of central mechanism, which can take incoming request, determine, which tasks should be allocated for each node, run necessary jobs and afterwards combine it together.

Of course, there are other things that one should worry about: what to do if one of the workers failed, how to restart the job? How to balance new incoming data between nodes?

Despite this complexity, if you have very strict set of tasks, then all of it can be solved in Julia with a rather short script, I think. Libraries and databases are big and complex, because they want to be universal. But still, you should think ahead: “how set of your tasks may evolve? Is it possible that questions, that you are going to ask this data processing unit, can be independent of each other and require additional efforts?”

And it’s hardly possible to go away from distributed computation in your case, these amounts of data are big and multiple nodes will always work faster than single server.

Oh, and if you want to do it manually (or even with something like Spark), definitely, you should use arrow or parquet as a data storage format. They are column based and have internal compression and as such is a much better choice than csv.

8 Likes

@sampope Out of interest, what instruments are you logging data from?

1 Like

(didn’t read the whole post here) I think GitHub - joshday/OnlineStats.jl: Single-pass algorithms for statistics is capable of working with endless data, so you could stream it.

5 Likes

DuckDB might also be an interesting option as it promises to be a fast analytical database that is easy to set up. There is no dedicated Julia interface yet, though.

1 Like