Iβm still learning Julia and am converting some SAS code. I have a large DataFrame (3.3M rows, 6 columns) that I am trying to sort. The performance compared to SAS is abysmal.
function sort1(simstates)
simstates2 = simstates |> Query.@orderby(_.simulation) |> @thenby(_.date) |> DataFrame
end
BenchmarkTools.Trial:
memory estimate: 26.65 GiB
allocs estimate: 572587197
--------------
minimum time: 73.236 s (11.40% GC)
median time: 73.236 s (11.40% GC)
mean time: 73.236 s (11.40% GC)
maximum time: 73.236 s (11.40% GC)
--------------
samples: 1
evals/sample: 1
The simulation column is an Int64.
the date column is a string. If I convert that to a Date the performance is even worse
BenchmarkTools.Trial:
memory estimate: 46.40 GiB
allocs estimate: 927642850
--------------
minimum time: 95.504 s (16.44% GC)
median time: 95.504 s (16.44% GC)
mean time: 95.504 s (16.44% GC)
maximum time: 95.504 s (16.44% GC)
--------------
samples: 1
evals/sample: 1
In SAS the corresponding sort takes about a second:
996 proc sort data=simstates out=simstates2;
997 by simulation date;
998 run;
NOTE: There were 3292000 observations read from the data set WORK.SIMSTATES.
NOTE: The data set WORK.SIMSTATES2 has 3292000 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.99 seconds
cpu time 2.26 seconds
The SAS sort is multi-threaded, as you can see from the difference in Real vs CPU times. The SAS sort is unaffected by the date column being a string vs date (in SAS dates are stored as a Double).
It is worth noting that SAS does not store data sets in memory, they are picked up and written to disk as needed. The OS probably has cached the input, but the output (about 151MB) is written back to the disk at the end of the step.
Iβm working to make the case that we start using more Julia vs our current SAS and R usage. How can I get these sort times down?
BenchmarkTools.Trial:
memory estimate: 626.88 MiB
allocs estimate: 1031565
--------------
minimum time: 2.857 s (3.83% GC)
median time: 2.933 s (3.68% GC)
mean time: 2.933 s (3.68% GC)
maximum time: 3.008 s (3.53% GC)
--------------
samples: 2
evals/sample: 1
Still slower than SAS by a factor of 3x, but workable.
That makes me think that I should never use the @orderby from Query.jl during joins and sort later. We have a lot of SAS code which uses the SAS SQL parser. Often we use that to order our joins and aggregations in a single step β reducing the overhead of writing and reading. The PROC SQL sort is just as fast as PROC SORT.
Is there an out of the box way to multi-thread this sort? I expect on a DataFrame this size it would help a lot.
If you do not need the original DataFrame anymore,
sort!(df, cols)
should be a bit faster because it does not need to allocate new memory for the output. But I do not know if the difference is significant in your case.
Probably billions in R&D over the years. I worked there for nearly 10 years as a consultant and then as a product manager.
The thing SAS got right very early on was the ease at which you can very efficiently manipulate data. The Data Step and then SQL allow them to go through large data sets in times that are unmatched. As a consultant, 80+% of the work I do is data manipulation. 15% is client hand holding, and the remaining 5% are actual analytics.
Thanks again for the pointer. I appreciate the help.
Hi, another ex-SAS here, btw
I guess youβll see performance gains once youβll be comparing longer programs with more and more data manipulation steps. The overhead of SAS always writing data to disk and re-reading it will show up.
Iβve used Julia for letβs say a βmid-sizedβ reporting task/solution, lotβs of aggregations, filtering, pivoting data etc, and Iβm happy with Julia/DataFrames perfomance.
I have no SAS version of the same to compare, but used it for decades and would guess SAS version would be at least few times slower overall.
I have a former client that has ripped out a multi-million dollar SAS install and replaced it with Julia. Performance timings are phenomenal. Their lead on the IT side actually published a book on Julia based on their development efforts.
That book is, unfortunately, stuck in my closed office from the quarantine.
@Dominic_Pazzula Another idea is to roll your own parallel merge sort. Iβm not sure if and when it will land on Base but thereβs a sample implementation:
This is the perfect opportunity to get some camo face paint and rappeling gear, infiltrate the site via a HALO jump landing on the roof, break in and Get The Book (which will also be the title of the movie β I have checked on IMDB and it appears to be unused; yet).
(If @tk3369 writes a sequel, there will be one for the movie, too, titled Get The Book 2: The Multiple Dispatch.)
sortperm is known to be slower than it could, and sort for DataFrames needs to use that, so thereβs room for improvement. Using radix sort would probably help too.
Another ex-SAS user here. I only today noticed the surprisingly slow sorting of Dataframes in Julia (compared to SAS proc sort). Will try the tips above to see if they help in my example with only 300k rows and sorting times of >10 seconds.
EDIT: It turns out that my dataframe sort was so slow because of unintentional use of the Decimal type instead of the usual Float64. Using the latter the sort time went from 10 seconds to negligible, 300 microseconds or so.