DataFrame sort Performance using Query.jl vs SAS PROC SORT

I’m still learning Julia and am converting some SAS code. I have a large DataFrame (3.3M rows, 6 columns) that I am trying to sort. The performance compared to SAS is abysmal.

function sort1(simstates)
     simstates2 = simstates |> Query.@orderby(_.simulation)  |> @thenby(_.date) |> DataFrame
end
BenchmarkTools.Trial:
  memory estimate:  26.65 GiB
  allocs estimate:  572587197
  --------------
  minimum time:     73.236 s (11.40% GC)
  median time:      73.236 s (11.40% GC)
  mean time:        73.236 s (11.40% GC)
  maximum time:     73.236 s (11.40% GC)
  --------------
  samples:          1
  evals/sample:     1

The simulation column is an Int64.

the date column is a string. If I convert that to a Date the performance is even worse

BenchmarkTools.Trial:
  memory estimate:  46.40 GiB
  allocs estimate:  927642850
  --------------
  minimum time:     95.504 s (16.44% GC)
  median time:      95.504 s (16.44% GC)
  mean time:        95.504 s (16.44% GC)
  maximum time:     95.504 s (16.44% GC)
  --------------
  samples:          1
  evals/sample:     1

In SAS the corresponding sort takes about a second:

996  proc sort data=simstates out=simstates2;
997      by simulation date;
998  run;

NOTE: There were 3292000 observations read from the data set WORK.SIMSTATES.
NOTE: The data set WORK.SIMSTATES2 has 3292000 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.99 seconds
      cpu time            2.26 seconds

The SAS sort is multi-threaded, as you can see from the difference in Real vs CPU times. The SAS sort is unaffected by the date column being a string vs date (in SAS dates are stored as a Double).

It is worth noting that SAS does not store data sets in memory, they are picked up and written to disk as needed. The OS probably has cached the input, but the output (about 151MB) is written back to the disk at the end of the step.

I’m working to make the case that we start using more Julia vs our current SAS and R usage. How can I get these sort times down?

Can you try using DataFrames? rather than Query.

sort(df, ["simulation", "date"])
1 Like

Thanks!

Much better:

BenchmarkTools.Trial:
  memory estimate:  626.88 MiB
  allocs estimate:  1031565
  --------------
  minimum time:     2.857 s (3.83% GC)
  median time:      2.933 s (3.68% GC)
  mean time:        2.933 s (3.68% GC)
  maximum time:     3.008 s (3.53% GC)
  --------------
  samples:          2
  evals/sample:     1

Still slower than SAS by a factor of 3x, but workable.

That makes me think that I should never use the @orderby from Query.jl during joins and sort later. We have a lot of SAS code which uses the SAS SQL parser. Often we use that to order our joins and aggregations in a single step – reducing the overhead of writing and reading. The PROC SQL sort is just as fast as PROC SORT.

Is there an out of the box way to multi-thread this sort? I expect on a DataFrame this size it would help a lot.

If you do not need the original DataFrame anymore,

sort!(df, cols)

should be a bit faster because it does not need to allocate new memory for the output. But I do not know if the difference is significant in your case.

1 Like

Tricky doing benchmarks since once the DataFrame is sorted, all subsequent sorts will be much much faster. Using @time

julia> @time sort!(simstates, [:simulation, :date])
  5.096369 seconds (1.03 M allocations: 501.298 MiB, 3.68% gc time)

Definitely less memory use.

I’m glad the DataFrames times are reasonable.

SAS has literally had millions of dollars in development over the past 20+ years, so it will take a while to get feature parity for sure.

There are efforts underway to make some of these operations multi-threaded, but nothing has been merged yet so they are still a little ways away.

Probably billions in R&D over the years. I worked there for nearly 10 years as a consultant and then as a product manager.

The thing SAS got right very early on was the ease at which you can very efficiently manipulate data. The Data Step and then SQL allow them to go through large data sets in times that are unmatched. As a consultant, 80+% of the work I do is data manipulation. 15% is client hand holding, and the remaining 5% are actual analytics.

Thanks again for the pointer. I appreciate the help.

2 Likes

Hi, another ex-SAS here, btw :slight_smile:
I guess you’ll see performance gains once you’ll be comparing longer programs with more and more data manipulation steps. The overhead of SAS always writing data to disk and re-reading it will show up.
I’ve used Julia for let’s say a β€œmid-sized” reporting task/solution, lot’s of aggregations, filtering, pivoting data etc, and I’m happy with Julia/DataFrames perfomance.
I have no SAS version of the same to compare, but used it for decades and would guess SAS version would be at least few times slower overall.

I have a former client that has ripped out a multi-million dollar SAS install and replaced it with Julia. Performance timings are phenomenal. Their lead on the IT side actually published a book on Julia based on their development efforts.

That book is, unfortunately, stuck in my closed office from the quarantine.

3 Likes

Oh, I’m reading that book right now and using his SASLib.jl package :smiley:
Thanks @tk3369 !

2 Likes

Hi @Dominic_Pazzula :wave:

You can try different sorting algorithms. Both QuickSort and MergeSort look fairly promising.

julia> daterange = Date(1980,1,1):Day(1):Date(2020,12,31)
1980-01-01:1 day:2020-12-31

julia> makedf(N) = DataFrame(simulation=rand(Int,N), date=Dates.format.(rand(daterange,N), "yyyy-mm-dd"))
makedf (generic function with 1 method)

julia> bigdf = makedf(3_300_000);

julia> size(bigdf)
(3300000, 2)

julia> describe(bigdf)
2Γ—8 DataFrame
β”‚ Row β”‚ variable   β”‚ mean       β”‚ min                  β”‚ median     β”‚ max                 β”‚ nunique β”‚ nmissing β”‚ eltype   β”‚
β”‚     β”‚ Symbol     β”‚ Union…     β”‚ Any                  β”‚ Union…     β”‚ Any                 β”‚ Union…  β”‚ Nothing  β”‚ DataType β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ simulation β”‚ 5.26081e11 β”‚ -9223370617627936065 β”‚ 4.23282e15 β”‚ 9223365035282778695 β”‚         β”‚          β”‚ Int64    β”‚
β”‚ 2   β”‚ date       β”‚            β”‚ 1980-01-01           β”‚            β”‚ 2020-12-31          β”‚ 14976   β”‚          β”‚ String   β”‚

julia> @btime sort($bigdf, ["simulation", "date"]);
  1.820 s (260180 allocations: 307.08 MiB)

julia> @btime sort($bigdf, ["simulation", "date"]; alg=QuickSort);
  1.391 s (52 allocations: 75.53 MiB)

julia> @btime sort($bigdf, ["simulation", "date"]; alg=MergeSort);
  1.484 s (54 allocations: 88.12 MiB)

The mutating sort! does seem to perform worse…

julia> @time sort(bigdf, ["simulation", "date"]; alg = QuickSort);
  1.505489 seconds (52 allocations: 75.534 MiB, 3.87% gc time)

julia> @time sort!(bigdf, ["simulation", "date"]; alg = QuickSort);
  1.731357 seconds (93.02 k allocations: 55.088 MiB)
1 Like

Hey Tom! Was just talking about you (see above). I’ll give them a shot. Thanks!

Hey I bought that book a couple weeks ago… seems pretty good. I have the Kindle version so no covid :wink:

@Dominic_Pazzula Another idea is to roll your own parallel merge sort. I’m not sure if and when it will land on Base but there’s a sample implementation:

Also, I have found much better performance with Date rather than String objects. Perhaps try again with sorting algorithm specified.

Also check out this amazing package by @tkf

1 Like

I actually did this in Go as a POC a while back. Passed a table from SAS to Go, sorted it, passed it back and wrote the result faster than PROC SORT.

I will definitely check out the ThreadX package.

This is the perfect opportunity to get some camo face paint and rappeling gear, infiltrate the site via a HALO jump landing on the roof, break in and Get The Book (which will also be the title of the movie β€” I have checked on IMDB and it appears to be unused; yet). :wink:

(If @tk3369 writes a sequel, there will be one for the movie, too, titled Get The Book 2: The Multiple Dispatch.)

3 Likes

sortperm is known to be slower than it could, and sort for DataFrames needs to use that, so there’s room for improvement. Using radix sort would probably help too.

2 Likes