Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning?

lmiq · November 30, 2020, 6:17pm

I am not a user of DataFrames, I just read the manual for the first time. Perhaps Google led me to a deprecated version of the manual.

Concerning the view, of course. But those mean different things, one copying the other don’t. Right?

bkamins · November 30, 2020, 6:23pm

Right. But for very tall and and data frames (and this was the original question) - copying is not only expensive but also uses up a lot of RAM (the main point is that in DataFrames.jl creation of views is an option - in many alternative ecosystems it is not possible).

The current version of the manual is Introduction · DataFrames.jl. What we discuss with getting column from a data frame is described here: Getting Started · DataFrames.jl.

pdeffebach · November 30, 2020, 7:01pm

The best advice might be to write functions that act only on vectors and then use the transform! infrastructure to use those functions with data frames. This way you can focus all your attention on optimizing your behavior and not have to think too much about the data-wrangling aspects related to data frames.

lyonsquark · December 1, 2020, 6:39am

Hi @jstrube, I haven’t tried Julia’s distributed processing on Cori. MPI is heavily tuned to the machine, network, and batch system - so the advice is to always use it. I have tried multi-threading. That worked well for the computational steps. Reading with HDF5 with multiple threads works, but you get no concurrency (to be thread safe, it only allows one thread to read at a time - the HDF5 folks have just started to think seriously about multi-threaded concurrent i/o). To get concurrent reads on the same node, you must use a multi-process system. MPI makes that easy and is built into the batch system. I think Julia’s distributed system would work, but is not as convenient and is likely slower since it’s not tuned to the peculiar high speed network on that machine. My impression is that the speed of processing the data in memory with multi-threads or multi-processes with MPI should be equivalent except for overhead. I plan to test this.

johnh · December 1, 2020, 7:12am

@lyonsquark Please keep us all in touch with what you find on Cori.
Your remarks on MPI using the interconnect effectively are very interesting. Cori uses a Cray Aries network. Fantastic to hear the Julia can be run over this.

johnh · December 1, 2020, 7:14am

@Kevin_Shen Welcome to Julia! I am going to say something - please to not take it as being rude. I have a big smile!
Maybe you should just suck it and see (*) there has been a good discussion here and maybe you should just try working on those 20 core machines? If you have problems there are experts here.

(*) My father used to say this. It is not rude at all!

lyonsquark · December 1, 2020, 7:29am

Hi @johnh, Thanks - but to be clear, MPI.jl calls the C functions in the system MPI library. It’s not “MPI in pure Julia”. That being said, for certain MPI operations, like reduce, broadcast, gather, etc, you can use Julia structures. Though this is tricky. Some functions require isbitstype and others are more lenient. But when it works, it works well!

jstrube · December 1, 2020, 3:15pm

Very interesting. Maybe I’ll try replacing the distributed part with MPI and leave the threading…

Topic		Replies	Views
Struggling with Julia and large datasets General Usage question , big-data	67	11028	October 17, 2024
How to use julia for implementing machine learning algorithms in parallel General Usage parallel	9	1738	August 11, 2017
What's the best way to work with millions of rows of data? Performance	7	2080	February 24, 2020
Julia Performance - Help Needed Performance question , python	40	2908	September 17, 2021
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1819	April 6, 2021

Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning?

Related topics