Struggling with Julia and large datasets

sairus7 · March 15, 2021, 5:31am

This is the biggest miss in your case, and a source of all other problems. Who on earth would save such amount of data in text format? I would not be surprised if it is just some sort of “debug log” and never intended to be a data format for export. You definitely should use some packed binary format from the very beginning.

Also, you can simply drop timestamp column, if it has constant sampling rate with no gaps. Change it to a lazy one, computed from a timestart and sampling rate.

johnh · March 15, 2021, 7:43am

As asked above, some information on the instrument or scanner which is producing this data would be interesting.

neophytedave · March 15, 2021, 7:31pm

@sampope, at the risk of sounding like an apostate, consider using J, the successor to APL. I do not know if you have the time to do this. I suggest it because J advertises itself as working well with large data (billions of rows). The learning curve will be steep, but if you can gain the expertise and code using J’s strengths, your execution time will be a fraction of what it currently is. It also has its down database. Good luck

jzr · March 15, 2021, 8:12pm

That is an interesting comparison, I haven’t seen any benchmarks comparing J and Julia. Do you know where I could find some?

neophytedave · March 15, 2021, 8:38pm

Dear @jzr:

Between J and Julia specifically, not that I know of. I see J as being such a different animal from Julia (and any other “conventional” languages) that I do not know if any exist. However, there is a comparision between processing an array via a for-loop and J. See Vocabulary/ArrayProcessing - J Wiki. The speed up is mind-boggling. However, see also https://www.slant.co/versus/123/133/~apl_vs_julia for a general comparison.

The J software site is, in general, a rich repository of information all things J.

mschauer · March 15, 2021, 9:04pm

Well, that speed-up is a speed-up between “J as language where you can write loops” vs. “J as vectorised language”. A nice feature of Julia is that it does not force either vectorisation or writing loops by hand on you.

djholiver · March 15, 2021, 9:56pm

Hi,

is it the case that you are starting with 100 B then need to filter / reduce to the 33 B statistically valid set at some point in time having ingested 258M lines across a given day? How do these numbers result in you being able to query? For how long do you have to keep hold of the 100B? How is Mongo currently helping - do you just push/insert all new datapoints into it?

From (admittedly scanning) the above - it seems that you are potentially better off acting on the data in a stream to transform and filter on one server (Kafka + Airflow would be ideal here to rate - limit your batches) and then passing the reduced set to the other: if have you have a real need for a single data file for execution purposes (does this get incrementally updated?), then, as a few mention above, mmap’ing via Arrow.jl is very powerful - if you are aware of the expected commonality of your data i.e. anything you can group, then you can create index - ranges describing values which can then be stored in an associated metadata file and use that for your queries. Mmapp’ing will only load the data that is being accessed into memory (which could still be a lot if the meta data is large), so could mitigate a lot of potential issues and will also allow you to write a for - loop over the rows so hardly any memory allocated (apart from any collections you populate in the loop of course). Note that you can actually apply Table Operations to the table to filter it and write to another Arrow file for more processing downstream (you could model this in as “functional coding” paradigm in this way).

You can also easily extend to multiple distributed processing on the Arrow file and/or partition it. If they supported Arrow Flight (which will be a game changer and on the roadmap) then things would become even easier to maintain within a resource limit.

As you can probably tell, I’m a huge fan of Julia + Arrow as it has allowed me to do some very heavy lifting in the middleware for analytical web dashboards using “pathetic” machine specs compared to the equivalent SQL DB requirement (which arent even as quick!). One thing to note is that a good SSD will make the world of difference here.

However, as someone else also mentioned above, if you can partition (this takes a bit of thought annoyingly and don’t let spark do it by no. of threads…!) then you can create a nicely structured parquet file that Spark will do a good job of querying.

If you are actually in need of constant updates to the base data, then it can get a bit tricky, but something like Spark delta lake can be an option, but then you are moving away from just Julia etc.

Regards,

neophytedave · March 15, 2021, 10:03pm

@mschauer as a self-described tyro, isn’t it preferable, wherever and whenever possible, to write “vectorized” code as opposed to non-“vectorized” code? When would writing non-“vectorized” code be preferable?

tbeason · March 15, 2021, 10:06pm

Not really. See Functions · The Julia Language

If you have more questions or want more discussion on this, please start a new thread as this point is somewhat orthogonal.

johnh · March 16, 2021, 3:05am

Another question - please describe the storage where this data is being written to and read from.
I have been involved with HPC for donkeys years and storage is critical. Performant storage makes a huge difference to what Panasas dubbed ‘time to solution’.
there are many choices of storage - please try to describe what you use, and please be open to alternatives.

jzr · March 16, 2021, 3:58am

Above they said

We have 2 physical servers, each with 64 cores (128 logical with hyperthreading), 1TB of ram, 20TB of NVMe drives in RAID6 setup.

johnh · March 16, 2021, 4:33am

I missed that. At a first look local NVMe drives should be fast.

johnh · March 16, 2021, 4:36am

Putting my foot further in it…
HPC codes usually perform better with hyperthreading switched off
I am not so sure about streaming data analysis like this - you could be doing compute as the other thread waits for IO

Please install the ‘hwloc’ utility and look at the physical layout of the ‘lstopo’ tool
Maybe worth experimenting with CPU pinning

sampope · March 16, 2021, 6:28am

First off, thanks to whomever changed the title.

I’ll try to decode the mystery here without violating NDAs, Privacy policies, clearance, and whatever else the lawyers dream up. I work in a research org focusing on non-standard problem solving, lots of lateral thinking.

The data is a mess. It’s from building systems (like keycard access, light management, facilities management, etc), medical instruments, unknown devices, etc. There are 18 different formats in the raw log files, although semi-structured (keyword or otherwise delimited). Some timestamps are missing years, some are written as sort of HTTP format, “Feb 09@13:34:43.493Z”. I volunteered to explore the data as a side project.

Most of the progress so far is in writing regex and trying to “normalize” the data. It’s then put into CSV format for lack of a better format. I also wrote a parser to write to JSON files, thinking that might be better. I think Arrow, HDF5, or Parquet make a lot of sense as destination formats.

Originally, I wrote everything in plain old ruby because it was fast to prototype the code and get results. Again, it’s a side project to my 40+ hour a week normal workload. I started down the past of Crystal, but found it a bit immature at this point. It is basically compiled Ruby on the other hand. While looking into R, Python, etc, I ran into Julia. I find after two weeks, I can prototype code almost as fast as I can with Ruby.

The raw data set is around 1 trillion log lines. When throwing out what i don’t think is needed, we’re down to 100 billion lines. When parsing that down to CSV/JSON formats, it’s 33 billion rows or JSON documents. When ordering everything into a giant timeline, we find there are about 258 million log lines per 24 hours period.

With the sheer performance of Julia, I was trying to find a way to just use julia to do all of the analysis without having to resort to “enterprise” databases and all their overhead. I think the “julia way”, from what I’ve learned from this helpful thread, is to use arrow/parquet/hdf5 files instead of csv/json or other text-based storage formats. And I need to do more data cleaning and analysis, not write my own in-memory database in julia.

I hope this thread helps others too. I’ve certainly learned a plethora from it.

johnh · March 16, 2021, 8:10am

Thankyou for a very interesting response!

sampope · March 16, 2021, 5:43pm

I’ve found the same. Using raw cores seems to make unixes faster.

Tamas_Papp · March 17, 2021, 8:26am

That’s a fairly typical situation in social science

I would agree; rolling your own database is almost surely a waste of time.

Having worked with messy big data in Julia, I think it is best to accept that you will go back and forth between data cleaning, normalization, and storage alternatives. IMO the best strategy is to just write flexible code, not worry about all the details in the first pass, and put the data processing/cleaning routines in their own (local, private) package, which you unit test aggressively, possibly with mock data. This makes optimizations easier (you worry less about breaking something), and can save a lot of surprises in the long run.

tamasgal · March 17, 2021, 9:53am

Ah OK, makes more sense now, thanks!

Yes, it’s much more about workflow/processing design than picking the right tool, so to say. What you describe is basically more or less the same situation as an astroparticle/high-energy physicist is dealing with, so I guess I am qualified to add yet another two cents to this already broad discussion ;).

We deal with similar amounts of data which are coming from many different sensors and are written to multiple places (to disks, tapes, databases…). The cleaning and pre-/post-processing of the data is done in many different stages and I think that this is what’s missing in your planning. It feels like you are trying to realise a whole-meal analysis (like putting everything into RAM or in a big DB) instead of thinking in processing pipelines which build upon each other.

Usually, data can be sampled down to chunks (often called “runs”, which corresponds to a set of data in a given time interval including the corresponding fractions from each data source) and these tiny pieces can be used to explore the data and to develop analysis workflows etc. which then can be scaled up to the full dataset. Quite often (at least in my field), you are not even allowed to touch a large portion of the data to not introduce bias, so that you need to request an “unblinding of the data” after some preliminary results on a tiny subset, but that’s another topic.

Long story short: we use a multitude of tools, languages and techniques to do the processing and combine those with a workflow management system (we picked Nextflow) which allows to scale it up easily, no matter if a single machine, multiple nodes, a grid of computing centres or the cloud.

I definitely think that Julia is a good choice to do the processing part, but as outlined above, I also think that you need tailored workflows and processing chains to do the job effectively and this is something which is hard to answer without knowing what you are actually looking for and what kind of analyses you plan to do (of course, I totally understand the NDA, privacy etc. part).

That being said, I also want to second that Julia and HDF5 are extremely good ingredients to cook this meal ;).

sampope · June 5, 2021, 9:54pm

UPDATE: After lots of starts and fits, I settled on Apache Drill against parquet files on a clustered filesystem. It works really well. I’ve worked around Parquet.jl issue 108 by converting the DateTime to String in a DataFrame before dumping to the parquet files. Julia is doing all of the raw log wrangling into parquet files. In fact, I’m now writing julia everywhere instead of ruby/shell. (I have a hammer, therefore everything is going to be a nail…) Apparently, based on sheer quantity of visits, my new favorite site is regexr; second only to julia docs.

The “voluntold” proof of concept is now moving into incubator status where I’ll get real help from better coders and data scientists to turn it into a fully supported project. And, it turns out, you can hire Julia-specialized consultants/contractors to help think through solutions and write code.

Overall, julia is great to work with. We signed up for Julia Computing support subscription and are generally happy all around with the Julia ecosystem.

I’ve seen things only you people would believe. julia running on 512 cores across 20 servers. I watched GPUs glitter in the dark near the PCIe Gate…

with apologies to Rutger Hauer

propelledanalytics · June 6, 2021, 2:33am

Hi Sam,

Apache Spark might be more performant for your workload size than Apache Drill. With the SparkSQL.jl Julia package you can use Julia and Spark together. Aside from performance, the SparkSQL.jl package can convert CSV files to parquet without the work-around you describe for date & timestamp values.

The SparkSQL.jl Tutorial:

Topic		Replies	Views
What's the latest and greatest in data in Julia Data	29	2147	August 15, 2024
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8702	June 5, 2021
What fileformat to use to load data for high performance computing Machine Learning	37	7040	December 1, 2018
A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data Offtopic	24	870	September 16, 2024
Best practices for local exploratory data analysis Data	15	1565	October 16, 2024

Struggling with Julia and large datasets

Related topics