Many of my coworkers use SAS (gross!) and aren’t familiar with the other languages listed in the benchmark comparison. It would be nice to add SAS to the comparison to persuade corporate users of the benefits of Julia. Any thoughts on whether this is possible?
Might be useful. I’d much rather see a new set of tabular data benchmarks comparing R, SAS, Stata, Julia etc. on a set of common dataframe operations (grouped summarize, spread, etc.).
Agree that tabular data comparison would be nice as well. I’m not sure where that benchmark would go though. Do you think that the DataFrame operations and Julia are at a point where Julia would compare favorably to these other languages?
A necessary condition for benchmarking comparisons is having all of the different software installed and running on the same system. Given that SAS is proprietary and quite expensive, it seems unlikely that we’re ever going to be able to include it in the benchmark results, so I’m not sure there’s much point to having benchmarks for SAS since we won’t be able to run them. Is there much reason to believe that SAS will be much different than other slow, interpreted, high-level dynamic languages (Python, Matlab, R, etc.)?
Yes, SAS is built to be fast for common data operations, and, though less flexible, is light-years faster than R.
Do you think that the DataFrame operations and Julia are at a point where Julia would compare favorably to these other languages?
I think that’s up in the air. R + dplyr and R + data.table are very fast, and so is SAS (when doing the kind of things they are good at, tabular data operations).
My understanding is that SAS’s performance is not due to an exceptional language implementation but rather to high-quality, out of core runtime libraries. Since our benchmarks are explicitly designed to test the language implementation itself it doesn’t seem likely that SAS would be exceptional here but if someone wants to implement SAS benchmarks and run them together with C and Julia somewhere it could turn out to be of interest.
I guess what I’d like to see is a table package performance benchmark. R, R + dplyr, R + data.table, SAS plus its runtime libraries, Julia + query, and Julia + dataframesmeta, etc.
That is a much more interesting comparison.
I don’t think SAS is really an appropriate comparison language for the benchmark on the homepage. Now, there is a growing consensus that other benchmarks are needed, one of them being data science related. There is some good work going on in SASLib.jl to get .sas7bdat files imported quickly, and there is also some good work going on related to sorting algorithms, for example. In both cases, comparisons to Python and/or R are being made when possible, but there still remains a need for a well thought out and executed suite of benchmarks that puts the different ecosystems to the test along several dimensions.
I just recently demonstrated how I can perform group by 60x faster in R using fst and my disk.frame package vs SAS.
SAS is actually slow because everything is disk-based. Yes you can load data into memory but it’s clunky and doesn’t yield performance gains soemtimes. Also its primary data format SAS7BDAT is row-oriented so every operations requires some row-by-row logic and it cannot benefit from columnar operations.
How did I get performamce of 60x? I use the fst format to load only the columns I need, not every column unlike in SAS and I process in parallel using all cores. Julia’s JuliaDB.jl, Python’s Dask, and R’s disk.frame can take on SAS for large data processing. Then we just need out of core algorithms for ML. OnlineStats.jl will provide many of them, JuML.jl has a promising algorithm implemented.
SAS is going the way of COBOL. Many people are not complaining, because who wouldn’t want $300k/pa for being good at programming with it. Here in Australia top SAS freelancers command AUD$1500-AUD$2000 a day.