Using DataFrames: ~ 10 seconds

I AM NOT COMPLAINING. I AM MERELY WONDERING.

I have seen a couple of messages about slow package adding/compiling/using times. Why is ‘using DataFrames’ so slow? (i.e., not the first time when using it, so presumably DataFrames has already been compiled at least to an intermediate stage and saved to disk somewhere already).

I am on a real high-end computer here, and ‘using DataFrames’ still takes me 10 seconds under 0.6.2. I suspect that it can easily take 20 seconds on something slower.

I know that DataFrames relies on many packages , but this still seems like a painfully slow start.

is this using so slow because the first time it only compiles to jit and there is a latter final compilation (from jit?) that takes so long? could something closer to the final compiled code be saved? will the using DataFrames speed up in 1.0, or will this be a long-term problem?

regards,

/iaw

While there are things done during precompilation, IIUC it doesn’t actually cache any native compiled code. There have been discussions in the Slack saying that yes, this is a big thing on the 1.x list.

2 Likes

What environment? On a 4-year-old MacBook Pro, I’m seeing about 2 seconds startup from a fresh REPL, and 6 seconds switching to DataFrames master and going through the pre-compiling step (where the warnings are significant amount of time).

1 Like

3.2 sec, on a relatively fresh install of 0.6.2, but I don’t have many packages downloaded, just dataframes and it’s dependencies. There might be an old dependency somewhere?
It requires these locally (some names shortened a little)

CatArrays
CodecZ
Compat
DataFrames
DataStreams
DataStructs
Missings
NamedTuples
Reexport
SortingAlgs
SpecialFuncs
StatsBase
TransStreams
WeakRefStrs

> @time using Dataframes
3.16 sec,  2.2M allocs, 124.7MiB, 10.8% gc time

3.2 seconds on 0.6.2? this is very speedy! I am on a macOS 3.2 GHz Xeon W with huge amounts of RAM and super-fast SSD, running julia version 0.6.2, and I get:

$ julia -q --startup-file=no
julia> s=now(); using DataFrames; now()-s
8869 milliseconds
^D
$ julia -q --startup-file=no
julia> s=now(); using DataFrames; now()-s
8697 milliseconds

How do I find (obsolete) dependencies, and more importantly, how do I clean them out? Pkg.update() tells me that there is nothing to install, update, or remove.

PS: Thx, Chris, for letting me know that this is on the 1.x agenda.

At one point, I had the same problem and removed everything in the ~/.julia/v0.6 directory and started over. That brought the time from ~30 seconds down to few seconds.

IIRC, you had Gadfly in your collection. Gadfly is still waiting for some infrastructure changes to be ready for current DataFrames, so others here are likely reporting on a newer version of DF than yours.

1 Like

For me, it is fast:

julia> s=now(); using DataFrames; now()-s
599 milliseconds

I am still on Julia 0.6, though, i7-7700K CPU.

Update: With Julia 0.6.2 it takes now 820 ms.

I did have a whole lot of packages, including Gadfly, on the slow computer. And I just tried it on a computer half as fast, no Gadfly installed, and I got only 2 seconds delay. so, these packages indeed slowed down DataFrames.

  1. conceptually, what happens when ‘using DataFrames’? What would allow an existing Gadfly (optional load?) package to slow down the DataFrames loading? Is Gadfly loaded, and Gadfly checks something elsewhere on the net, which makes it so slow?

  2. is there a way to ask for a “quick local using load”, rather than an involved load with all options and net checks? so that optional packages do not slow it down?

  3. how would a new user diagnose such problems? I was completely clueless here, and would have considered it a basic julia flaw. this sort of terrible slowdown could become a user-experience negative for newbies that can easily lead to negative press for julia itself.

  4. is starting over completely necessary? or just a removal of Gadfly?

  5. are there packages better not be installed to avoid slow behavior of other more essential packages, or is gadfly the exception?

thanks everyone. /iaw

You have arrived in our neighborhood while many of the utilities are being renovated; one can live well here, but there are various inconveniences. The legacy package manager has well known problems (although this particular one seems rare), but we hope and expect that Stefan K’s well-designed replacement (Pkg3, now in beta?) will correct most of them. In particular, isolated environments with small dependency trees will be a great feature. This situation does mean that there isn’t much incentive to hunt down particular flaws in the old one.

Gadfly is one of the last packages to await transition to the new version of DataFrames, so it pins it in such a way that other interdependencies are quite tangled. This presumably leads to a long tour through the awkward file layout of metadata. You could try removing just Gadfly, then running Pkg.update.

3 Likes

It’s ok to complain. Start-up times are going to be a serious issue for the adoption of the language because it is going to really piss off a lot of people, even though the majority of current Julia users (myself included) are perfectly fine with compile times and related issues (except for the current state of plotting).

I, like most others, don’t have this problem however, takes about 3 seconds for me. As @Ralph_Smith stated, the package manager has serious problems right now, and it’s extremely unfortunate that a lot of people are stuck on old versions of all sorts of things. I suggest you try really hard to make sure everything is as up-to-date as possible.

I am quite happy to wait for 1.x. I would be less happy if I did not see relief in sight. I do worry about the “first impressions” that julia leaves on newbies.

PS: I am wondering why load times for packages could not be sub-20ms, assuming a fast SSD. dependency lists and code should all be compilable and remain nearly the same between uses.

I can tell you from experience that people coming from Python will be apoplectic over this stuff. That really annoys me, but that’s how it is. That said, I don’t think Julia should focus on winning over people who are using Python who are happy with Python. Those people are probably just not interested in the benefits that Julia has to offer. I think we will have much better luck with the people who don’t like Python, who are looking for alternatives like Cython, Numba or Scala, and users of R and MATLAB.

My understanding that there is usually some amount of actual compiling going on, though I am becoming increasingly hazy on how exactly the compiler works, particularly in 0.7 now that there are fancy things like constant propagation and (apparently) extremely aggressive inlining (which I’m loving!). You have to remember that Julia is not like C++: what needs to be compiled for a package is usually not uniquely determined by the package itself. Fortunately most of us here probably care way more about these fancy compiler features than we do about compile time.

3 Likes

they will look first for dataframe and plot functionality, and scratch their heads. but I have made this point elsewhere. no need to warm up again.

Agreed. R users are used to the powerful data wrangling and visualization tools while Python users are even less likely to transfer to Julia.

I use Julia as a Free Matlab and Modern Fortran, and I have no issue with the starting time and package loading time. However, I think if people use Julia in a more interactive way, they may get upset.

I don’t understand that. DataFrames.jl by now is really excellent (especially with DataFramesMeta.jl and Query.jl). I have used IndexedTables.jl far less, but it seems very nice. Admittedly plotting is the one area where the loading times are completely screwing everything up. Plotting is definitely a serious problem right now.

1 Like

I think it’s a bit unfair to point to plotting though. They are just the first where it seems to matter that native compiled code isn’t cached, and the whole ecosystem needs this kind of tooling. When that gets added the plotting should be “fixed”.

2 Likes

My understanding was that this sort of thing wasn’t even really on the radar until well after 1.0, am I wrong about that?

It’s not on the radar until a 1.x since it’s non-breaking to add it, that’s correct. But from the discussions it doesn’t seem to me that it will be “well after”.

On the R plotting thing - do note that the gold standard plotting package in R, ggplot2, is a package/module - not in core. Likewise dplyr is an R package, not core. And they came into R quite late - well after it was a dominant datasci language. So, I’m not sold yet on Julia itself being insufficient. I will be very happy when Gadfly.jl, Query.jl and DataFramesMeta.jlare at an R level though. DataFrames.jlis awesome already, post 0.11.

Second, I’m getting rather tired of hearing about how upset the “hordes” will be, and I’m not even a committer. Not to ignore future users, but in my opinion this constant worrying and hypothesizing really feels unhealthy and could even be warping the focus of the essential basics. Kudos to the core team for keeping that balance right.

8 Likes