I’m still learning Julia and am converting some SAS code. I have a large DataFrame (3.3M rows, 6 columns) that I am trying to sort. The performance compared to SAS is abysmal.
simstates2 = simstates |> Query.@orderby(_.simulation) |> @thenby(_.date) |> DataFrame
memory estimate: 26.65 GiB
allocs estimate: 572587197
minimum time: 73.236 s (11.40% GC)
median time: 73.236 s (11.40% GC)
mean time: 73.236 s (11.40% GC)
maximum time: 73.236 s (11.40% GC)
The simulation column is an Int64.
the date column is a string. If I convert that to a Date the performance is even worse
memory estimate: 46.40 GiB
allocs estimate: 927642850
minimum time: 95.504 s (16.44% GC)
median time: 95.504 s (16.44% GC)
mean time: 95.504 s (16.44% GC)
maximum time: 95.504 s (16.44% GC)
In SAS the corresponding sort takes about a second:
996 proc sort data=simstates out=simstates2;
997 by simulation date;
NOTE: There were 3292000 observations read from the data set WORK.SIMSTATES.
NOTE: The data set WORK.SIMSTATES2 has 3292000 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.99 seconds
cpu time 2.26 seconds
The SAS sort is multi-threaded, as you can see from the difference in Real vs CPU times. The SAS sort is unaffected by the date column being a string vs date (in SAS dates are stored as a Double).
It is worth noting that SAS does not store data sets in memory, they are picked up and written to disk as needed. The OS probably has cached the input, but the output (about 151MB) is written back to the disk at the end of the step.
I’m working to make the case that we start using more Julia vs our current SAS and R usage. How can I get these sort times down?
memory estimate: 626.88 MiB
allocs estimate: 1031565
minimum time: 2.857 s (3.83% GC)
median time: 2.933 s (3.68% GC)
mean time: 2.933 s (3.68% GC)
maximum time: 3.008 s (3.53% GC)
Still slower than SAS by a factor of 3x, but workable.
That makes me think that I should never use the @orderby from Query.jl during joins and sort later. We have a lot of SAS code which uses the SAS SQL parser. Often we use that to order our joins and aggregations in a single step – reducing the overhead of writing and reading. The PROC SQL sort is just as fast as PROC SORT.
Is there an out of the box way to multi-thread this sort? I expect on a DataFrame this size it would help a lot.
Probably billions in R&D over the years. I worked there for nearly 10 years as a consultant and then as a product manager.
The thing SAS got right very early on was the ease at which you can very efficiently manipulate data. The Data Step and then SQL allow them to go through large data sets in times that are unmatched. As a consultant, 80+% of the work I do is data manipulation. 15% is client hand holding, and the remaining 5% are actual analytics.
Thanks again for the pointer. I appreciate the help.
Hi, another ex-SAS here, btw
I guess you’ll see performance gains once you’ll be comparing longer programs with more and more data manipulation steps. The overhead of SAS always writing data to disk and re-reading it will show up.
I’ve used Julia for let’s say a “mid-sized” reporting task/solution, lot’s of aggregations, filtering, pivoting data etc, and I’m happy with Julia/DataFrames perfomance.
I have no SAS version of the same to compare, but used it for decades and would guess SAS version would be at least few times slower overall.
I have a former client that has ripped out a multi-million dollar SAS install and replaced it with Julia. Performance timings are phenomenal. Their lead on the IT side actually published a book on Julia based on their development efforts.
That book is, unfortunately, stuck in my closed office from the quarantine.
This is the perfect opportunity to get some camo face paint and rappeling gear, infiltrate the site via a HALO jump landing on the roof, break in and Get The Book (which will also be the title of the movie — I have checked on IMDB and it appears to be unused; yet).
(If @tk3369 writes a sequel, there will be one for the movie, too, titled Get The Book 2: The Multiple Dispatch.)