Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning?

I am not a user of DataFrames, I just read the manual for the first time. Perhaps Google led me to a deprecated version of the manual.

Concerning the view, of course. But those mean different things, one copying the other don’t. Right?

Right. But for very tall and and data frames (and this was the original question) - copying is not only expensive but also uses up a lot of RAM (the main point is that in DataFrames.jl creation of views is an option - in many alternative ecosystems it is not possible).

The current version of the manual is Introduction · DataFrames.jl. What we discuss with getting column from a data frame is described here: Getting Started · DataFrames.jl.

2 Likes

The best advice might be to write functions that act only on vectors and then use the transform! infrastructure to use those functions with data frames. This way you can focus all your attention on optimizing your behavior and not have to think too much about the data-wrangling aspects related to data frames.

2 Likes

Hi @jstrube, I haven’t tried Julia’s distributed processing on Cori. MPI is heavily tuned to the machine, network, and batch system - so the advice is to always use it. I have tried multi-threading. That worked well for the computational steps. Reading with HDF5 with multiple threads works, but you get no concurrency (to be thread safe, it only allows one thread to read at a time - the HDF5 folks have just started to think seriously about multi-threaded concurrent i/o). To get concurrent reads on the same node, you must use a multi-process system. MPI makes that easy and is built into the batch system. I think Julia’s distributed system would work, but is not as convenient and is likely slower since it’s not tuned to the peculiar high speed network on that machine. My impression is that the speed of processing the data in memory with multi-threads or multi-processes with MPI should be equivalent except for overhead. I plan to test this.

@lyonsquark Please keep us all in touch with what you find on Cori.
Your remarks on MPI using the interconnect effectively are very interesting. Cori uses a Cray Aries network. Fantastic to hear the Julia can be run over this.

@Kevin_Shen Welcome to Julia! I am going to say something - please to not take it as being rude. I have a big smile!
Maybe you should just suck it and see (*) there has been a good discussion here and maybe you should just try working on those 20 core machines? If you have problems there are experts here.

(*) My father used to say this. It is not rude at all!

Hi @johnh, Thanks - but to be clear, MPI.jl calls the C functions in the system MPI library. It’s not “MPI in pure Julia”. That being said, for certain MPI operations, like reduce, broadcast, gather, etc, you can use Julia structures. Though this is tricky. Some functions require isbitstype and others are more lenient. But when it works, it works well!

Very interesting. Maybe I’ll try replacing the distributed part with MPI and leave the threading…