Firstly, I have experienced afew disprepancies in my own experience vs Mr McKinley’s benchmarks a few times. Check my twitter feed and my replies to Mr McKinley’s previous tweets. So I don’t know sometimes. Could be my setup is different and the files I read are different.
Secondly, To use the pyarrow csv reader to do subsequent processing you would need to convert the read object into a pandas. So the benchmark should benchmark reading and then into usable form. And I dont think the pyarrow does that well from my own testing (refer to point 1).
I believe that is what my benchmark does, see here.
You mean Wes McKinney? What is your twitter account? Also, I’m not writing about any benchmark by Wes, I wrote about the benchmarks that I created and ran.
I had a quick look at them. I think the tested files are too small. To me, 7s vs 11s isn’t that big a deal but 7min vs 11min is. And many of these things only show up for large datasets. I think cos of RAM usage differences.
evalparse. Even I have troubled finding those tweet. But I disticnt recall Mr McKinley tweeted about some benchmarks then I check it with the Fannie Mae data that I posted above and on my machine the benchmarks would have come out different to what’s in the blog post.
memory is important. Is 2 threads already max out your available RAM then 16 threads ain’t gonna help.
I’m not saying that my benchmark is great, but I am saying that the benchmark mentioned at the top of this post does not seem to use the state-of-the-art competition on the non-Julia side. Julia might well have the fastest CSV parsing today, but to figure out whether that is actually the case or not, a benchmark needs to use the state-of-the art competition (pyarrow), not the previous, many-years-old CSV reader (pandas).
I think it would just be really helpful if the benchmark that was quoted was redone with pyarrow in parallel mode included.
The real user experience is the performance of first read.
Sometimes I feel these benchmark results are not reflecting the reality – we can say how fast it loads starting the second time, but in practice nobody would load the same file again a second time.
Actually that’s not always true (though it is true in most cases).
I once had data that was scattered in tens of thousands csv files with the same metadata. They were “indexed” by the folder structure. There was no meaning in combining them all in one huge file, cause then I’ll get file close to 1 Tb and usually only small subset of these csv was needed.
But of course, it’s exceptional case, usually you really do not need to load csv more than once.
This takes 7s on my computer versus 811ms for pandas and 275 for fread in R.
I agree that the first time is important. Then again, if it is just a few seconds (maybe 7 is too much though), I don’t think it matters too much to the user. At the end of the day 7 seconds won’t matter much.
ENV["R_HOME"] = "C:\\Program Files\\R\\R-4.0.2"
Pkg.build("RCall")
@time using CSV;@time using DataFrames;@time using PyCall;@time using RCall;@time using BenchmarkTools
R"require('data.table')"
pd = pyimport("pandas")
bdir=mktempdir()
cd(bdir)
download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv", "test_data.csv")
@time df = CSV.File("test_data.csv") |> DataFrame; # including compilation #7s
@btime df = CSV.File("test_data.csv") |> DataFrame; #200ms
@btime pydf = pd.read_csv("test_data.csv"); #811ms
@btime rdf=R"fread('test_data.csv')"; #275ms
It is my impression from the graphs that you get 60–80% of the peak performance with 6/8 threads, which is pretty standard for machines used in numerical work these days. The marginal benefit of extra cores is heavily diminishing after this. But, of course, having a beefier CPU never hurts
If I was a potential new user considering a switch to Julia, after reading a blog post like this the questions I would be most interested in are
The ingredients for this great result. What is special about Julia that makes this feasible? Or was it just a ton of hard work that other languages could replicate? Or is it something that is relatively easy in Julia, but would take a lot of effort in other languages?
If I have a complex parallel calculation I need to make fast, is there a general lesson here about Julia that applies to my problem?
How much of the improvement in CSV reading is coming from the language and the library? Is there anything that made problems like CSV reading slow initially, was recognized by the language designers, and was remedied to improve it? This tells a great story about how responsive the language community is to performance problems.
I have to disagree. My workflow for data analisys is having a Jupyter session that stays open for the whole day, and in the same notebook I sometimes read multiple distinct CSV files, and many times I read from the same filename dozens of times the same day, because either: (1) it is the most practical way for me to get the unchanged data and be sure I did not tamper with it in any way; (2) I have re-executed the script (outside of the Jupyter notebook) that generates said CSV and now it has new fields.
Completely agree - maybe it’s just evidence of a careless way of working, but I’ve never done any data wrangling task where I didn’t have to load the data many times over before figuring out the correct sequence of steps to produce the desired output from the unaltered raw data.
I agree (from the perspective of a data analyst). One possible scenario is that data loading, data manipulation and data analysis is developed with a small data set and then goes productive with really large data sets. Maybe the “performance of first read” is then no longer so painful.
I work a lot with R and in this environment the performance for the process steps described above is - in my experience - not so essential, but the overall performance of the ecosystem.
Of course I think it’s great when Julia reads CSV files super fast, but is that enough to convince a data analyst of Julia’s performance? What counts here is the whole Julia ecosystem and that has to stand up to R and Python.
But of course it should be celebrated that Julia is so fast!
I’ve been wanting to move away from R to Julia for years. The thing that kept me in R was my familiarity with it, and ggplot producing really high quality graphs.
Speed is an issue, but usually it’s bogging down inside Stan. A major reason for wanting Julia is to get a unified single language that’s fast and well thought out I hate that R is so hackish. with several different object orientation systems, nonstandard evaluation, and all the code that’s performance sensitive is inside C/C++ libraries.
At heart I’m a lisp hacker. Julia has that feel of a language made for people who wear big boy and girl pants… you need to put in some work but then you have POWER. R is like a toolbox with plenty of wrenches to let you adjust someone else’s knobs… but Julia feels like a workshop where you have welding machines lathes mills and plasma cutters.
This, oddly, captures exactly how I feel like Julia. I started at 0.5, and I still consider myself a beginner/intermediate. I am slowly learning computer science fundamentals (like how threads work, what allocations are, heap vs stack, and so on).
I’ve converted a few friends to Julia and when they wrote their first script, it was horrendous. Globals, no type stability, no preallocation of vectors (or allocating inside a loop), and so on… This is because they were coming from R/Matlab and were not trained in CS (I had to explain to them what “types” are and difference between column/row major memory storage).
Overtime, I start to doubt the notion that the two-language problem can be actually considered a problem. R and Python have both gradually evolved into an interface to C/C++. When a new idea comes out, it can be quickly written in R/Python if the author just wants to publish a paper. Meanwhile, she can write it in C/C++ and interface it with R/Python if it is considered a computation-intensive and long-term project.
This mechanism guarantees that the most commonly used packages in R and Python do not have a severe performance issue. There is no doubt that it is much easier to write Julia codes than C/C++ codes. However, when the package is a long-term one with many user/contributor, the extra effort will be spreaded out over many contributors and a long time period, then for each contributor the extra effort may even become insignificant.
There are several problems with this model still. The first is that many types of problems require fast user level functions. This is not dealt with well by python/R. Another is that this makes it much harder to improve important code. In Julia anyone that can code in Julia can submit a PR to improve a library that they are using. With a 2 language solution, they need much more experience and skill, and the fix will likely be harder anyway. Thirdly, by having performance critical code in C, you inherit all of the inflexibility of C which makes designing good APIs harder.
When I write code, which is frequently–almost always–highly performance sensitive, it will be used by a small group of people, none of whom can write C code. As far as I know this is the norm among engineers in my field. We know Matlab and Python. I know Julia. Noone knows C. (I’ll amend that: off the top of my head, I have had two colleagues who knew C, in ~20 years.)
I think you underestimate pain slow code custom written by a researcher can cause. Waiting for slow code sucks!
it can be quickly written in R/Python if the author just wants to publish a paper.
actually means hours of wasted time waiting for code to run. And it’s not obvious how to improve performance in, say, dplyr whereas in julia there are a bunch of ways to write more optimized functions going in and out of, say, DataFrames.
I think the extremely rapid growth of the Julia ecosystem shows exactly why it’s a problem. In Julia, we’re able accomplish the same amount of work with orders of magnitude less effort. How many people have worked for how many years to make R and Python’s CSV parsers as fast as they are? Then @quinnj comes along and absolutely smokes them singlehandedly. Oh and then every single Python project seems to need its own bespoke CSV parser—they can’t reuse them because who has ever heard of code reuse or composability? In Julia, you can compose CSV.jl with any tabular data structure implementation and the compiler specializes the composition of the two so we only need one really good, generic implementation.
The clear trajectory I’m seeing is that in an ever-increasing number of areas, Julia is too good to be ignored. Why? Because the language offers so much leverage to library developers that the result is hands down better than what will ever be available in Python or R. Do you think either language will ever catch up with Julia’s DiffEq ecosystem in breadth or performance? Well, it’s not like solving differential equations wasn’t useful before Julia came along, and they had decades to do it and didn’t, so I would say we already know the answer. Instead of taking a small army of C and C++ experts (who are somehow also diffeq experts?) a decade to write something like DifferentialEquations.jl, it took just @ChrisRackauckas a year or so to eclipse what’s available elsewhere.
All it takes to become dominant in any area of numerical programming is one or two good Julia programmers with a vision to spend a few years making that vision a reality. After that there’s simply no catching up using a lesser language. It’s too late—game over. This work doesn’t happen overnight, but it is happening in one area after another.