Struggling with Julia and large datasets

sampope · March 10, 2021, 11:01pm

TLDR: Skip to the 3rd paragraph for the questions.

backstory

I’m working with a decent sized dataset, around 100 billion log lines from sensor data, updating at 258 million lines per day at nanosecond resolution. Originally, I quickly wrote some prototype code in ruby to analyze the data from the logfiles in place. After the disk i/o became too great and burnt out disks, we tried a series of databases: sqlite, mariadb, postgresql, lucene (solr and elastic search), cassandra, and now mongo. What I learned is that database administration is a full-time job and distracts from getting results from our data, and also requires hardware which is the equivalent cost of 3-5 grad students. And, none of the databases perform as expected/advertised without massive hardware clusters.

We just received a grant for another six months of progress on this research. Last month, I started to wholesale re-think what we’re doing, which led me to julia on the recommendation of some friends in various other organizations. Compared to ruby, julia is vastly faster at everything. I’ve been able to write new code at far higher productivity and code performance levels than in the first prototype period.

the questions

Working with a statistically valid subset of the data, around 33 billion log lines, what is the “julia way” to work with the data? DataFrames.jl looks nice, but our subset data size is around 10TB and that doesn’t fit in ram (we have 1TB). JuliaDB seems abandoned? Even with this smaller dataset, SQL databases struggle. I wrote a parser, in julia, to parse the raw logs into csv, which greatly reduces the total data size to around 4TB. We did upgrade the servers to pure NVMe disks, so filesystem i/o is vastly faster now, but still 1000x slower than memory.

Could I treat the csvs on the filesystem as a “data base” and write code as queries against those? Is there some other way I’m missing? Do we really have to suffer the slings and arrows of outrageous databases? Could we hire julia computing or some julia consultant/company to help figure this out versus making a grad student suffer through databases and being a codemonkey?

Happy to read whatever I’m missing, while we fight with importing BSON into mongodb this week.

I’m truly impressed with all things julia at this point. julia is a love versus writing c/asm for performance.

Thanks!

jzr · March 10, 2021, 11:05pm

How many columns are there?
What data types?
What kind of queries do you need to make on the data?

sampope · March 10, 2021, 11:23pm

between 3 and 6 columns
data is timestamp in ISO 8601 format to nanoseconds, Int64, or Strings.
The questions mostly key on timing and graph analysis of sensor events.

Satvik · March 10, 2021, 11:37pm

I’m very much not an expert, but I’ll make some suggestions, and hopefully somebody else will make better ones.

One big option is to look into memory mapped arrays. This lets you create an array larger than memory, where only part of the array is loaded into memory at a time. See Memory-mapped I/O · The Julia Language

Another option is to look into using distributed Julia. In Julia, a “cluster” is just a bunch of machines with passwordless ssh access to each other, so it’s actually relatively easy to set up. I use KissCluster on AWS. Here are some relevant links:
https://docs.julialang.org/en/v1/stdlib/Distributed/

jzr · March 10, 2021, 11:38pm

Might consider using Arrow.jl instead of CSV.

jzr · March 10, 2021, 11:58pm

Or maybe something like BigQuery is a better bet.

jebej · March 11, 2021, 12:36am

If the data is 4TB as a CSV, you could probably store it more efficiently in a binary format rather than text.

Try converting the dataset to an HDF5 file. If you can’t create the data in one go because loading the whole csv is impossible, you can incrementally write to the file:

https://juliaio.github.io/HDF5.jl/stable/#Chunking-and-compression

Cloves_Almeida · March 11, 2021, 12:39am

Have you tried dedicated time-series/analytical databases? While not as maintenance free as pure Julia, they can process larger than memory datasets efficiently.

QuestDB and ClickHouse are popular options in this arena and have Julia connectors.

Other than that, forget CSVs and stick to binary formats. Arrow or Parquet for column oriented storage, Avro for row oriented is popular.

sampope · March 11, 2021, 1:53am

Thank you for all of the great feedback!

We’re forbidden from using “the cloud” due to various legal issues with the datasets (healthcare/PII), otherwise we could outsource all this db overhead to some cloud provider.

@jzr I’ve not heard about Apache Arrow, so I’ll check it out.

@jebej Will investigate HDF5!

What I am learning is that I should assume we need a db admin as part of our next grant application.

Thanks!

sampope · March 11, 2021, 1:53am

@Cloves_Almeida we researched those but it seems like more stuff to manage, but to be fair, we haven’t actually tried them yet.

jzr · March 11, 2021, 2:10am

What I am learning is that I should assume we need a db admin as part of our next grant application.

Depending on what you mean by administration, some of the work can be avoided by using a managed database from a cloud vendor.

johnmyleswhite · March 11, 2021, 2:13am

I think that was explicitly ruled out earlier because of legal requirements.

kevbonham · March 11, 2021, 3:21am

Definitely +1 for Arrow.jl. You can even create the arrow file in batches from your csvs, and then load it into a (read-only) DataFrame even if it’s larger than memory.

Nosferican · March 11, 2021, 4:18am

My recommendation would be to do as much within the database. For example, I tend to develop most of the cleaning and pre-processing in a PostgreSQL server and only then I bring what I need to Julia. Arrow might be useful at an intermediate step as needed.

jzr · March 11, 2021, 4:30am

Ah you’re right.

I might check self-hosted TimescaleDB

tkf · March 11, 2021, 6:46am

Using pre-existing dedicated DB might be better as others have suggested, but, if you still want to do it in Julia, FYI, Transducers.jl provides a set of composable and datastructure/input-agnostic tools for data manipulation on sequential, threaded, GPU, and distributed execution contexts. For example, there are a family parallelizable transducers such as GroupBy, ReducePartitionBy, SplitBy (WIP), and Consecutive. They are designed to be very generic; e.g., many of these “advanced” transducers are likely to be supported on GPU. Also, the transducer approach fits very well with out-of-core parallel and streaming execution.

Skoffer · March 11, 2021, 7:49am

Definitely would recommend to use some distributed event oriented database. Usual relational DB (like PostreSQL, MySQL/MariaDB), document oriented (Mongo) or NoSQL solutions (Cassandra) are not suited well for this kind of task. You should better consider something like Clickhouse, Druid, Pinot, InfluxDB. Or even just plain old Spark, which was designed for this sort of tasks and is rather fast. But yes, these are all “Big Data” solutions and they require some time to set up properly.

You can do something in plain Julia of course, but then at least some ideas from these solutions should be used. Data should be partitioned and indexed, i.e. it should be split in relatively small files, spread over nodes and there should be mechanism to identify where necessary piece of information is located.

Then you need to have some sort of central mechanism, which can take incoming request, determine, which tasks should be allocated for each node, run necessary jobs and afterwards combine it together.

Of course, there are other things that one should worry about: what to do if one of the workers failed, how to restart the job? How to balance new incoming data between nodes?

Despite this complexity, if you have very strict set of tasks, then all of it can be solved in Julia with a rather short script, I think. Libraries and databases are big and complex, because they want to be universal. But still, you should think ahead: “how set of your tasks may evolve? Is it possible that questions, that you are going to ask this data processing unit, can be independent of each other and require additional efforts?”

And it’s hardly possible to go away from distributed computation in your case, these amounts of data are big and multiple nodes will always work faster than single server.

Oh, and if you want to do it manually (or even with something like Spark), definitely, you should use arrow or parquet as a data storage format. They are column based and have internal compression and as such is a much better choice than csv.

johnh · March 11, 2021, 12:05pm

@sampope Out of interest, what instruments are you logging data from?

yakir12 · March 11, 2021, 12:37pm

(didn’t read the whole post here) I think https://github.com/joshday/OnlineStats.jl is capable of working with endless data, so you could stream it.

Jakob · March 11, 2021, 12:48pm

DuckDB might also be an interesting option as it promises to be a fast analytical database that is easy to set up. There is no dedicated Julia interface yet, though.

Topic		Replies	Views
What fileformat to use to load data for high performance computing Machine Learning	37	7090	December 1, 2018
Importing big data General Usage question	21	5495	November 14, 2017
How is the data ecosystem right now for large datasets? Data	35	6775	July 13, 2017
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8739	June 5, 2021
How do you store your data before and after processing with Julia? Data	35	7250	March 1, 2021

Struggling with Julia and large datasets

backstory

the questions

Related topics