DataFrame sort Performance using Query.jl vs SAS PROC SORT

Dominic_Pazzula · June 11, 2020, 7:45pm

I’m still learning Julia and am converting some SAS code. I have a large DataFrame (3.3M rows, 6 columns) that I am trying to sort. The performance compared to SAS is abysmal.

function sort1(simstates)
     simstates2 = simstates |> Query.@orderby(_.simulation)  |> @thenby(_.date) |> DataFrame
end

BenchmarkTools.Trial:
  memory estimate:  26.65 GiB
  allocs estimate:  572587197
  --------------
  minimum time:     73.236 s (11.40% GC)
  median time:      73.236 s (11.40% GC)
  mean time:        73.236 s (11.40% GC)
  maximum time:     73.236 s (11.40% GC)
  --------------
  samples:          1
  evals/sample:     1

The simulation column is an Int64.

the date column is a string. If I convert that to a Date the performance is even worse

BenchmarkTools.Trial:
  memory estimate:  46.40 GiB
  allocs estimate:  927642850
  --------------
  minimum time:     95.504 s (16.44% GC)
  median time:      95.504 s (16.44% GC)
  mean time:        95.504 s (16.44% GC)
  maximum time:     95.504 s (16.44% GC)
  --------------
  samples:          1
  evals/sample:     1

In SAS the corresponding sort takes about a second:

996  proc sort data=simstates out=simstates2;
997      by simulation date;
998  run;

NOTE: There were 3292000 observations read from the data set WORK.SIMSTATES.
NOTE: The data set WORK.SIMSTATES2 has 3292000 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.99 seconds
      cpu time            2.26 seconds

The SAS sort is multi-threaded, as you can see from the difference in Real vs CPU times. The SAS sort is unaffected by the date column being a string vs date (in SAS dates are stored as a Double).

It is worth noting that SAS does not store data sets in memory, they are picked up and written to disk as needed. The OS probably has cached the input, but the output (about 151MB) is written back to the disk at the end of the step.

I’m working to make the case that we start using more Julia vs our current SAS and R usage. How can I get these sort times down?

pdeffebach · June 11, 2020, 7:46pm

Can you try using DataFrames? rather than Query.

sort(df, ["simulation", "date"])

Dominic_Pazzula · June 11, 2020, 8:00pm

Thanks!

Much better:

BenchmarkTools.Trial:
  memory estimate:  626.88 MiB
  allocs estimate:  1031565
  --------------
  minimum time:     2.857 s (3.83% GC)
  median time:      2.933 s (3.68% GC)
  mean time:        2.933 s (3.68% GC)
  maximum time:     3.008 s (3.53% GC)
  --------------
  samples:          2
  evals/sample:     1

Still slower than SAS by a factor of 3x, but workable.

That makes me think that I should never use the @orderby from Query.jl during joins and sort later. We have a lot of SAS code which uses the SAS SQL parser. Often we use that to order our joins and aggregations in a single step – reducing the overhead of writing and reading. The PROC SQL sort is just as fast as PROC SORT.

Is there an out of the box way to multi-thread this sort? I expect on a DataFrame this size it would help a lot.

lungben · June 11, 2020, 8:07pm

If you do not need the original DataFrame anymore,

sort!(df, cols)

should be a bit faster because it does not need to allocate new memory for the output. But I do not know if the difference is significant in your case.

Dominic_Pazzula · June 11, 2020, 8:13pm

Tricky doing benchmarks since once the DataFrame is sorted, all subsequent sorts will be much much faster. Using @time

julia> @time sort!(simstates, [:simulation, :date])
  5.096369 seconds (1.03 M allocations: 501.298 MiB, 3.68% gc time)

Definitely less memory use.

pdeffebach · June 11, 2020, 8:23pm

I’m glad the DataFrames times are reasonable.

SAS has literally had millions of dollars in development over the past 20+ years, so it will take a while to get feature parity for sure.

There are efforts underway to make some of these operations multi-threaded, but nothing has been merged yet so they are still a little ways away.

Dominic_Pazzula · June 11, 2020, 8:29pm

Probably billions in R&D over the years. I worked there for nearly 10 years as a consultant and then as a product manager.

The thing SAS got right very early on was the ease at which you can very efficiently manipulate data. The Data Step and then SQL allow them to go through large data sets in times that are unmatched. As a consultant, 80+% of the work I do is data manipulation. 15% is client hand holding, and the remaining 5% are actual analytics.

Thanks again for the pointer. I appreciate the help.

rvasil · June 11, 2020, 8:51pm

Hi, another ex-SAS here, btw
I guess you’ll see performance gains once you’ll be comparing longer programs with more and more data manipulation steps. The overhead of SAS always writing data to disk and re-reading it will show up.
I’ve used Julia for let’s say a “mid-sized” reporting task/solution, lot’s of aggregations, filtering, pivoting data etc, and I’m happy with Julia/DataFrames perfomance.
I have no SAS version of the same to compare, but used it for decades and would guess SAS version would be at least few times slower overall.

Dominic_Pazzula · June 11, 2020, 8:56pm

I have a former client that has ripped out a multi-million dollar SAS install and replaced it with Julia. Performance timings are phenomenal. Their lead on the IT side actually published a book on Julia based on their development efforts.

That book is, unfortunately, stuck in my closed office from the quarantine.

rvasil · June 11, 2020, 9:32pm

Oh, I’m reading that book right now and using his SASLib.jl package
Thanks @tk3369 !

tk3369 · June 12, 2020, 12:22am

Hi @Dominic_Pazzula

You can try different sorting algorithms. Both QuickSort and MergeSort look fairly promising.

julia> daterange = Date(1980,1,1):Day(1):Date(2020,12,31)
1980-01-01:1 day:2020-12-31

julia> makedf(N) = DataFrame(simulation=rand(Int,N), date=Dates.format.(rand(daterange,N), "yyyy-mm-dd"))
makedf (generic function with 1 method)

julia> bigdf = makedf(3_300_000);

julia> size(bigdf)
(3300000, 2)

julia> describe(bigdf)
2×8 DataFrame
│ Row │ variable   │ mean       │ min                  │ median     │ max                 │ nunique │ nmissing │ eltype   │
│     │ Symbol     │ Union…     │ Any                  │ Union…     │ Any                 │ Union…  │ Nothing  │ DataType │
├─────┼────────────┼────────────┼──────────────────────┼────────────┼─────────────────────┼─────────┼──────────┼──────────┤
│ 1   │ simulation │ 5.26081e11 │ -9223370617627936065 │ 4.23282e15 │ 9223365035282778695 │         │          │ Int64    │
│ 2   │ date       │            │ 1980-01-01           │            │ 2020-12-31          │ 14976   │          │ String   │

julia> @btime sort($bigdf, ["simulation", "date"]);
  1.820 s (260180 allocations: 307.08 MiB)

julia> @btime sort($bigdf, ["simulation", "date"]; alg=QuickSort);
  1.391 s (52 allocations: 75.53 MiB)

julia> @btime sort($bigdf, ["simulation", "date"]; alg=MergeSort);
  1.484 s (54 allocations: 88.12 MiB)

The mutating sort! does seem to perform worse…

julia> @time sort(bigdf, ["simulation", "date"]; alg = QuickSort);
  1.505489 seconds (52 allocations: 75.534 MiB, 3.87% gc time)

julia> @time sort!(bigdf, ["simulation", "date"]; alg = QuickSort);
  1.731357 seconds (93.02 k allocations: 55.088 MiB)

Dominic_Pazzula · June 12, 2020, 12:29am

Hey Tom! Was just talking about you (see above). I’ll give them a shot. Thanks!

dlakelan · June 12, 2020, 1:14am

Hey I bought that book a couple weeks ago… seems pretty good. I have the Kindle version so no covid

tk3369 · June 12, 2020, 2:30am

@Dominic_Pazzula Another idea is to roll your own parallel merge sort. I’m not sure if and when it will land on Base but there’s a sample implementation:

tk3369 · June 12, 2020, 2:51am

Also, I have found much better performance with Date rather than String objects. Perhaps try again with sorting algorithm specified.

tk3369 · June 12, 2020, 6:26am

Also check out this amazing package by @tkf

Dominic_Pazzula · June 12, 2020, 11:57am

I actually did this in Go as a POC a while back. Passed a table from SAS to Go, sorted it, passed it back and wrote the result faster than PROC SORT.

I will definitely check out the ThreadX package.

Tamas_Papp · June 12, 2020, 2:24pm

This is the perfect opportunity to get some camo face paint and rappeling gear, infiltrate the site via a HALO jump landing on the roof, break in and Get The Book (which will also be the title of the movie — I have checked on IMDB and it appears to be unused; yet).

(If @tk3369 writes a sequel, there will be one for the movie, too, titled Get The Book 2: The Multiple Dispatch.)

nalimilan · June 14, 2020, 6:28pm

sortperm is known to be slower than it could, and sort for DataFrames needs to use that, so there’s room for improvement. Using radix sort would probably help too.

klwlevy · August 11, 2023, 9:14pm

Another ex-SAS user here. I only today noticed the surprisingly slow sorting of Dataframes in Julia (compared to SAS proc sort). Will try the tips above to see if they help in my example with only 300k rows and sorting times of >10 seconds.

EDIT: It turns out that my dataframe sort was so slow because of unintentional use of the Decimal type instead of the usual Float64. Using the latter the sort time went from 10 seconds to negligible, 300 microseconds or so.

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6779	July 13, 2017
ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames Package Announcements sort , dataframes	6	904	June 26, 2020
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	954	December 6, 2017
Any way to speed up sorting a dataframe? A more efficient sortperm would be great General Usage sort , sortperm , dataframes	2	959	June 25, 2020
Help with first time working on performance New to Julia performance	6	1028	April 14, 2017

DataFrame sort Performance using Query.jl vs SAS PROC SORT

Related topics