DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc)


#1

In addition to the core functions in DataFrames.jl, I have come across libraries such as Query.jl and DataFramesMeta.jl which provide general querying and manipulation of DataFrames. Are there compelling reasons to use one of these options over the others? Are there notable differences in performance between these libraries? I am currently using Query.jl as it was suggested in the DataFrames.jl documentation, though there seems to be quite some overlap with DataFramesMeta.jl in terms functionality.


#2

In short, Query is more general, but DataFramesMeta is faster.


#3

There’s also JuliaDBMeta and SplitApplyCombine.jl, so clearly there’s a lot of activity in this area in Julia at the moment.

My take is that this just shows how strong the language is in allowing so much experimentation with different approaches to data manipulation. In most languages you see one, maybe two “main” frameworks for doing this kind of stuff, but that’s also because the subset of “language X developers who can also develop in C/C++” is small. In Julia, you can write everything in Julia and also match C/C++ performance, and so the pool of developers to contribute to such projects is much larger.

Obviously it can be tricky as a new user to know what to choose, but overall I think it’s very healthy for Julia overall to be experimenting w/ different approaches that each might have different strengths and weaknesses. I do think it would be worthwhile to have some kind of document that helped compare the various approaches, so a new user could glance over and at least have some idea which direction to try first.


#4

As a new Julia user I have been exploring all of these packages lately and trying to figure out the best way to mix and match. I can appreciate the generalness of Query, the future is very bright, especially the ability to translate into SQL one day. Although without transform/mutate it can be tough to go query only, but I love the interface options (LINQ style or pipe/_ style). My first reaction was, this doesn’t do all that dplyr/SQL does, but that’s not really fair given the small about of contributors doing a ton of work.

I’m excited for the future of all of these packages, I think it’s just a matter of time. Will try to get further up to speed so I can help as well and hopefully other people trying these out for the first time look at it as an opportunity not a shortcoming.


#5

I know that I am a minority opinion here, but I find it absolutely delightful how much is possible using only Base and the simple methods provided within DataFrames.jl itself. Dataframes are basically just Dicts of AbstractVectors and usually the only “special” operations you’d want to use them for are from the relational algebra such as join and by (groupby). For myself at least, I can’t think of a single case in which some combination of join, by, filter, sort, indexing, and some basic user created functions are not sufficient, or result in code which is more confusing than if I had used some sort of domain-specific language. Since DataFrames are light-weight and Julia has good performance, this stuff is usually totally painless and I often wind up with only a few lines of code even without resorting to any of the querying packages.

Query.jl, DataFramesMeta.jl are great packages, but I also recommend giving the minimalistic approach a shot. I find there to be great virtue in data manipulation code being simple “normal” (i.e. using generic manipulations from Base rather than a domain-specific language) code.

I wasn’t aware of that, looks nice!


#6

+1 for the minority opinion. Is it worth trying to add to the DataFrames docs with some examples of this, given that they currently push users into Query.jl quite quickly?


#7

bump to this. as far as I can tell, Query is great for selecting data, but its difficult to create a new column while keeping the rest of the dataframe intact. My understanding is that Query is working on this, but it’s not there yet. So if you want to emulate dplyr, for now, DataFramesMeta is probably the best choice with DataFrames.


#8

Very wise words. Especially while learning more about the language doing things using base and some light wrapper functions will help me learn.


#9

@ExpandingMan I also believe there’s a lot to be said for taking a minimalist approach when possible. To @nilshg’s point, I initially assumed I would need to use other frameworks for even relatively simple data manipulations because the DataFrames documentation is quite sparse and suggests we should use Query.jl. Some more examples would be a great help to new users. I have since found this useful DataFrames tutorial, for those who are interested.

I’ll also echo @quinnj’s suggestion that a high-level document that lists currently available frameworks and compares them in terms of performance, functionality, and philosophy would be useful. A good starting point might be something like the JuliaML org on GitHub, but instead for the various data frameworks.


#10

Just to second what both @nalimilan and @quinnj wrote above.

I hope to add dplyr @select and @mutate like stuff in the near future to Query.jl. Both of those couldn’t be done with the named tuple story we had on julia 0.6, but should be super simple now that we have native named tuples in julia 1.0. In my mind that is probably the biggest usability problem in Query.jl currently.


#11

Could anyone make a simple ranking of their speeds and another ranking of their complexity or capabilties?
Which one would be more similar to R’s data.table?