How is the data ecosystem right now for large datasets?


#1

I am currently working with a dataset of several hundred million observations in Stata, which as I am aware is not particularly efficient at handling datasets this large. It can take close to an hour to process a few split-apply-combine statements for example, and this is on a fast computer.

Are the algorithms used in the Julia data ecosystem good for this? In particular, there is a Stata package, ftools, which claims that Stata’s built in algorithms are O(N log(N)), but that O(N) algorithms exist for many standard operations. I don’t know much about working with large datasets, so I can’t say why this is, but maybe the Julia packages use the more efficient algorithms.

One discussion I’ve seen on this topic is https://github.com/JuliaData/DataTables.jl/pull/17 , which suggests R’s data.table is considerably faster than the Julia version. I don’t know if this has improved, or if things are better in Query or another package.

I would rather use Julia for data work because I like the language, and if it is faster than Stata that is even better!


#2

I am in a similar situation (500 GB of CSV data, 3e8 observations, panel data matched by individual IDs). I tried to explore options in the Julia ecosystem, but similar questions here did not get answers, so I am under the impression that there is no general out of the box solution.

I ended up creating a 1% subsample of my data for exploratory analysis, then after I decided what to do I encoded things in a tighly packed binary format which I can serialize, it also fits in memory. I read the data and convert, then work with the gzipped serialized data dump.


#3

Unfortunately, we haven’t done a lot of benchmarking at the moment. On the positive side, data.table is known to be very fast, and there’s no guaranty that Stata is as fast as data.table. So we might be competitive with Stata. I would love to find out. If you could post an example of the operations you’re doing (with dummy data which looks more or less like what you have), that would be useful.

It’s also possible things improved a bit with recent pull requests (though probably not by much), and if you want to use aggregate there’s this open PR.

Finally, for large data sets like this, using database engines via one of the various JuliaDB packages can be a good idea.


#4

Also, I don’t think the O(N) vs. O(N log N) criticism applies to us, as we don’t use the naive grouping algorithm that (they say) Stata uses. Ours is inspired by Pandas, and we don’t sort on each variable unnecessarily. Looks like we use similar approaches based on hashing.


#5

You can use Pandas.jl to get the efficiency of Pandas C-based data frames while generally retaining the good parts about Julia.


#6

Is there a place to go to see current benchmarks of Pandas.jl vs. DataFrames/DataTables and data.table/tibble? It seems like these benchmarks are being used quite a lot in the development of these packages.


#7

As I said these benchmarks don’t really exist, at least not in a systematic way (which is unfortunate).


#8

Oh, you did say that above. Sorry.


#9

I am in a similar situation right now. As far as I can tell, DataTables will give pretty good performance, but I’m not really doing much grouping and joining, most of my work has to do with getting tables into machine-ingestable formats.

Here has been my approach, though I have no idea how similar our use cases are:

  • I implement the DataStreams interface everywhere that doesn’t already have it. This provides a bare minimal interface to tabular formats that can be plugged into larger interfaces. SQL is a huge pain in the ass and a significant obstacle to getting this kind of thing to work nicely. Part of the reason why SQL is so terrible is that there are about a billion different implementations of it, all of which are completely different but all of which claim to be SQL. I really wish it would just go away.
  • I dump data into feather (Feather.jl) files. I have a PR which allows me to access individual fields in feather files. The PR is in limbo right now because @quinnj is overhauling DataStreams, but I expect it’ll be merged after being changed to be compatible with whatever the new DataStreams interface looks like.
  • I have written a package Estuaries.jl which allows me to pull views from anything that implements the DataStreams interface. Estuaries itself has a DataTables-like interface. I haven’t yet put any thought into using it for doing grouping and joining.
  • I’m currently working on a systematic way of getting data into machine-ingestable formats in TheDataMustFlow.jl. It’s very much a work-in-progress, but at least in principle it allows me to apply machine-learning to any dataset which implements the DataStreams interface (even those too large to fit in memory) and I have designed it to be able run in parallel or on distributed systems (not something I’ve tried yet).

Again, that’s not saying much about grouping and joining, currently my thinking is that I’ll try to do stuff like that on whatever DB stuff I’m forced to integrate with. A lot of people around me insist on using Python so I suspect that I’ll eventually make more use of Pandas.jl for interoperability purposes, at which point I’ll write a DataStreams interface for it as well.


#10

Just for information, I very much doubt that Query.jl would do well on this kind of thing currently. I have worked on performance for a small number of operations, but for many others I just use relatively naive algorithms for now. The overall design should allow for very efficient algorithms, but that requires a lot of careful work on performance. I hope to do that at some point.

Having said that, I would actually be quite interested if someone did some performance comparisons on things like groups, joins etc for the various options that we have in julia right now and how they compare to the speed in R and Python etc. That would help with prioritizing future performance work.

Best,
David


#11

I just did a very rough benchmark. The test is

N = 100_000_000
A = rand(N)
B = rand(1:100, N)
dt = DataTable(A = A, B = B)
@time by(dt, :B, d -> mean(d[:A]))

I did this for DataTable, DataFrames, Query, Pandas.jl, and in Stata
DataTable: 7.6s
DataFrames: 9.9s
Query: 10.5s
Pandas: 1.8s
Stata: 148s (this is a by and a join)

Here’s the code I used for Query and Pandas
Query:

@time x = @from i in dt begin
@group i.A by i.B into g
@select {m = mean(g)}
@collect DataTable
 end

Pandas:

df = DataFrame(Dict(:A=>A, :B=>B))
mean(groupby(df, "B"))

#12

I haven’t done any true benchmarking but I have found the performance of Query.jl to be very poor on large datasets, and quite unpredictable in general. After a quick survey of your code it certainly wasn’t clear to me why it should perform so poorly, but I certainly didn’t do any kind of detailed analysis. @aaowens example of Query.jl performing comparable to DataFrames and DataTables has certainly not been my experience, but perhaps I should revisit it.

I would certainly be extremely interested in Query.jl working on larger datasets. Query.jl is the one thing in the ecosystem right now that, with some fairly minor extensions, would solve most data manipulation problems. Another important feature for large datasets that’s missing from Query.jl is some sort of batching scheme. I was considering looking into if I can use Query.jl as a front-end for TheDataMustFlow, but I don’t know if that’s realistic yet.

I’m also pretty confused as to why Pandas wins by that much. Aren’t the groupby algorithms used basically the same now? I kind of dislike pandas because I find it opaque and unnecessarily complicated, so I’d be pretty happy if we could beat it.


#13

Thanks so much, this is super helpful!!

I have to say I’m quite surprised Query.jl doesn’t fare worse than this. It is not great compared to pandas, but given that I have not done any performance work at all on the grouping operation, I’m surprised it is somewhat competitive with DataTable.

I just tried one small modification to your benchmark: I created a DataTable where the columns are just Arrays, not NullableArrays (I think in general that will be easier in practice with a PR that is pending). On my machine that improves performance of the Query version by about a third, and when I compare the DataTable performance with the Query performance it seems more or less a wash (i.e. much closer than in the case with NullableArray columns).

I think one reason for that is probably that I’m using the DataValue type from https://github.com/davidanthoff/DataValues.jl inside the query, and that does not yet hook into the whole null_safe_op machinery that Nullable uses to make elementary operations like addition etc. more performant. My hope would be that once those basic operators in DataValues.jl are optimized in the same way as the Nullable operations in NullableArrays.jl we would see better performance. PRs would be very welcome!


#14

One area that I think I know is problematic is Strings (yeah, I know, those are kind of important…). In particular, I think if any column holds Strings I will run into the issue that a struct that has a field that is of a type that is allocated on the heap (like String), that struct will also always be allocated on the heap. If that is true (I’m not 100% sure for Strings), it should completely kill performance with the current Query design: at the core it iterates over a stream of NamedTuples (one for each row), and if these NamedTuples end up being allocated on the heap instead of the stack all hope is lost. I’m not doing anything about that right now because I hope that https://github.com/JuliaLang/julia/pull/18632 will just solve that problem. So generally, I would not be surprised if performance is very unpredictable between e.g. a source with Strings and one without.

But any pointer about inconsistencies that you encounter would be great. Even though I haven’t done much perf work in general, that can really help me focus in on important areas.

Yes, I certainly want it to work with large datasets as well! We’ll have to see how far we can go with the iterator based design it has right now. In principle that is all based on a streaming model that should work pretty well with large datasets, but that is more a hunch of mine right now. Having said that, Query also supports completely custom implementations of all the query operators by a given source type. For example, it would be entirely feasible that when you run a Query query against JuliaDB, it would not use the iterator based implementation of the query operators at all, but instead optimized algorithms from JuliaDB. A fair bit of work to get that up and running, but the core design of Query.jl has that kind of scenario very much in mind.

Yeah, that would be good to understand. One explanation would be that it is using some sort of index, i.e. if the column by which things are grouped was already stored in a sorted way, it would be pretty clear why this could be so fast. But I don’t think that is the case here, actually…

Another interesting comparison would be dplyr and R data.table.


#15

I have been considering using Feather for sanitized data that does not fit into memory, since it can be mmaped. But I don’t know how to write a Feather file when your data won’t fit in memory — I can read it into SQL, but can I then just save it by column progressively?


#16

Thanks @aaowens. At least, even if we’re four times slower than Pandas, we’re much closer to it than Stata. :slight_smile:

That said, I’ve done some profiling, and it turns out we should be able to use a more efficient approach. I’ve commented on technical details on a DataTables PR.


#17

By the way, using @time in this way includes compile time, so that is another possible reason for the apparent slowness of the Julia packages.


#18

I think you can use append in DataStreams. I’ll have more info on this soon.

Unfortunately my initial experience with it was bad enough that my response was to immediately abandon it rather than try to figure out the real reasons for the slowness. That said, I should probably revisit it, it would be very gratifying to be able to use Query.

By the way, does Query support Pandas? Is there a DataStreams interface for Pandas?

Before I started doing data science and was doing HEP, we always used a serialization format called a ROOT tree and in retrospect I really wish I had taken more time to appreciate how wonderful the tools I was working with really were. I’ve actually toyed with the idea of going back to using ROOT trees. One of the big problems with doing data science as opposed to HEP is that you have data in all sorts of ridiculous formats like strings and the ungodly absurdity that is date-time. When I did HEP everything was a double (occassionally an integer) and if you had suggested to me that I store date-time data anywhere on my file system I would have kicked you out of my office. This has led me to toying with the idea that all data should be only ints and floats, but that it should be made easy for it to be displayed to humans in other ways (this is already the case of course, but what I mean is more along the lines of only allowing data to be stored in e.g. Vector{Int} or Vector{Float64}). I’m getting waaaay off topic now, but these are just some of the ideas that have been bouncing around in my head. I’m really sick of spending time manipulating data. I’m sure we all are.


#19

My Stata benchmark wasn’t quite right. Stata doesn’t support multiple in memory datasets, so the by operation I benchmarked was actually a by plus a join.

Regarding joins, for DataTables:

dt2 =  by(dt, :B, d -> mean(d[:A]))
join(dt, dt2, on = :B)

For pandas:

df2 = mean(groupby(df, "B"))
df3 = merge(df, df2, left_on = "B", right_index = true)

I suspect there is some way to do this all in one go, like with broadcasting and the dot notation in Julia.

DataTables: 459s
Pandas: 14s


#20

Ouch. Thanks for catching this. I really need to reconsider the new algorithms we use. Better use DataFrames until then, it gives timings much closer to Pandas.