Suggestion: move DataFrames, plotting into standard distribution

I’m working on it and getting closer to something usable :slight_smile:

6 Likes

Great work towards that end @davidanthoff - FWIW the things that I notice most that keep me from the epic dplyr patterns I had in R, are:

I know you’re working hard, for free, so I’m not complaining! Thanks!

Could you describe a bit more what you mean by that?

In general, I’m really not happy with the column handling right now in Query. I can fix all of that in a quite elegant way on julia 0.7 with the native named tuples, so I’m kind of holding of on doing anything on that front until 0.7 is a real thing…

The multiple @lets I also know how to fix, but it is a complicated fix… I essentially need to get rid of my reliance on type inference. I have a strategy and started the work, but don’t expect anything anytime soon, I’m afraid.

I’ve also started thinking about window functions a bit, but still need to find a design that fits nicely into the philosophy of Query.jl…

And thanks for the kind words :slight_smile:

My experiences with in bioscience is very similar: R’s built in plotting and ubiquitous dataframes hold everything together.

@ChrisRackauckas Well said Sir.
Julia IS an evolving language. That’s what makes it so much fun to be involved with.
And just wait for the 1.0 release - I bet lots of organisationw will surface and start using Julia.
So please lets not have any bloat in Base. For instance there might be some rapid development needed in DataFrames. As you say this can be undertaken, safe in the knowledge that other fields (terrible pun intended) will be unaffected.

Also I would like to learn more about Pkg3
The concept of having a set of 'blessed’packages is a good one.
I propose @TimHoly be elected Blesser in Chief. Nominative determinism at its best.

Ps. I think the Julia packaging system is great. I know there are other similar systems out there. But as an old FORTRAN head who literally managed packages by putting cards into a hopper, and a long term Perl user where there are endless runs of cpan to load in packages, the Julia system is a joy.

For instance I just looked at Tim Holy’s GitHub page and I’m now playing with ProgressMeter with a Pkg.add(“PogressMeter”) and I didn’t even have to exit the REPL.

1 Like

This specific feedback is actually very useful.

Concerning verbosity, I’ve translated the same hflights tutorial to JuliaDB here and I find the syntax reasonably concise. Can you pinpoint specific cases that could be improved/specific suggestions on how to do so? As a caveat, you need a very recent version of JuliaDB and IndexedTables to run the tutorial as most syntax improvements are recent.

Concerning dplyr ggplot2 integration via pipeline (I’m not sure how that’d work exactly as I’m not a R user) the closest I can think of is the @df macro from StatPlots which is fully integrated with the Query/IterableTables framework (though I have some idea to simplify the syntax even further when plotting from a @map statement, but I haven’t quite decided how, I’m curious what @davidanthoff has in mind). There is also GroupedErrors to make plots from data tables if you’re working with grouped data.

See this announcement, though I haven’t focused on Query integration (as ShiftedArrays and Query have different missing data representation): will think about it once there is convergence.

I’d be curious to see how to add rownumber: what does it do exactly? You can use it inside a groupby and it will give you the row numbers of the group as computed inside the larger dataset?

2 Likes

Basically, yes. In dplyr the row_number() is the number of the row within a group, given your chosen sort order. In R, I can’t think of a place I’ve used it except for row_number() == 1, essentially to get the first member of a group. So, a isfirst() would be even more useful. The lead and lag functions are useful for moving averages and deltas.

There is the opportunity to do better than R with Julia here, ultimately.

Edit: Correcting myself, I have also used row_number() to create an ID to an item within a grouping. So in R, I’d arrange(main_id, foo, date); group_by(main_id, foo); mutate(foo_id = row_number()) . That’d permanently capture the order of foo’s for each main_id even after I left that grouping.

Actually there are two things you can use instead in Julia (which I believe are just as convenient), you can find both in the Window functions section of this tutorial:

  1. select to choose the first few elements according to the sort order (here it’s reverse sorting by DepDelay). It’s actually a very smart function that can give the first n element of the sorted vectors without sorting the whole thing and can use all of sort keywords:
groupby(fc, :UniqueCarrier, select = (:Month, :DayofMonth, :DepDelay), flatten = true) do dd
    select(dd, 1:2, by = i -> i.DepDelay, rev = true) # select first two element of subdataframe according to custom sorting
end
  1. ordinalrank from StatsBase can give you the ranking according to some custom sort order (other rankings are also available): see docs.

More generally I think we really need to start porting R data wrangling tutorials/vignettes in Julia as a lot of the functionality is there but not everybody knows about it.

Crummy excel spreadsheets also make referring to the first row useful, even if only for deleting it.

@piever I’ve been playing a lot with https://github.com/fredo-dedup/VegaLite.jl. The julia syntax right now is way too verbose, but I have a couple of branches where I’m trying out alternative designs, and I’m getting closer to something that I think can compete with ggplot2 or something like that. The plumbing works pretty well already, we have full iterable tables and FileIO integration (so something like load("mydata.csv") |> @filter(_.age>20) |> vlplot(:circle, x=:colA, y=:colB) |> save("figure.pdf") works on my branches).

The whole thing will (also just on a branch right now) fully integrate with https://github.com/davidanthoff/DataVoyager.jl, which should make for a really powerful data exploration tool.

I see, so far the VegaLite syntax/functionality seems very similar to StatPlots @df macro. What I was toying with (only in my mind, no code written) was whether it made sense to “merge” the @select (now @map) and the plot statement, and all the selected columns would be included in the plot with keywords corresponding to the column name (and somehow grouping columns would be grouping in the plot as well). So for example:

load("mydata.csv") |> @filter(_.age>20) |> @map({x = _.colA, y = _.colB}) |> vlplot(:circle) |> save("figure.pdf")

The main advantage would be that one can do extra-processing in the @map step, for example:

load("mydata.csv") |> @filter(_.age>20) |> @map({x = _.colA, y = log(_.colB / ._colA)}) |> vlplot(:circle) |> save("figure.pdf")

(which in StatPlots we allow with the macro trick and dot broadcasting, but this new design has less duplication of work and let’s Query take care of all the data related things).

In GroupedErrors instead I have explicit @x and @y steps to choose the variables from the iterable tables and a @set_attr macro to set values of attributes according to the group (for example, if I want the line to be dashed or full according to some grouping variable, or any other attributes).

Glad to know you’re working on VegaLite syntax: it’d be really nice if that could be made more concise.

1 Like

Thanks for your post about JuliaDB. I found it really helpful.:grinning:

First, I am not sure if I am asking too much, I personally think when there is no risk of messing things up, to be able to refer columns by variable name is easier. For example, I think x is easier than :x or df.x or _.x, since we are doing all the data manipulation within a DataFrame (even when we need to merge multiple data frames, we can still do this if we do it in a proper order), the name of a column is sufficient to show what is going on. When there is confusion, just use :x.

Second, I am not a Computer Science guy, my personal understanding is that we use data frames so we do not need to work with arrays and all kinds of loops. My 2 cent opinion is that maybe it is easier to avoid for or while or do when working with data frames.

Third, the integration of time series functions (moving average, moving standard deviation, etc.) and data frames is also important.

Fourth, the integration of reading data, manipulating data, and visualizing data. In R, we can do something like this,


df %>% fread() %>%
         filter() %>%
         mutate() %>%
         group_by() %>%
         summarise() %>%
         ggplot()

Fifth, a small point. I just noticed that the @transform in DataFramesMeta does not allow the use of computed variables immediately afterwards, which makes it less useful as the dplyr::mutate.

using DataFrames, DataFramesMeta, Lazy

df = DataFrame(A = [1, 2, 3], B = [4, 5, 6])

df = @> begin
    df
    @transform( x = :A + :B, y = :x -1 )
end

ERROR: KeyError: key :x not found

In R, this can be done like


df %<>% mutate(x = A + B, y = x -1)

This result shows that @transform is an equivalent of dplyr::transform instead of dplyr::mutate. See the the comparison table between DataFramesMeta, dplyr, and LINQ on this link:

https://github.com/JuliaStats/DataFramesMeta.jl

1 Like

So this one is really, really tricky in julia. R has the whole parent.frame() story that makes it pretty easy to create the kind of interface that you see in dplry, but there is no equivalent in julia to that. I’ve been wrecking my brain about this issue for almost two years now, and I think if we are looking for anything even vaguely generic, we’ll have to prefix column names by something in these kind of packages. So in Query.jl that is _.foo right now in the standalone syntax. It would be nice if we could get rid of one more character, say something like ~foo instead, but of course ~ has some other meaning already and I don’t know whether we still have a character that would be used for this case.

I’m still pretty far away from the R polish, but do take a look at the Queryverse story for this. The file IO story is pretty complete and integrates fully with the piping, and the standalone query operators also work in this way. For plotting, I think I’m getting there slowly with VegaLite.jl, and at that point we should have the full pipeline that you showed in your post. But, that will still take a couple of weeks/months to really work well.

1 Like

Thanks for your reply. I really appreciate your and the other Julia contributors’ efforts to make Julia better. I will try your Queryverse and I really like your VS Code Julia add on. :grinning:

I can’t find the discussions now, but I think this was discussed in DataFramesMeta. One possibility would be to assume variables refer to columns/fields by default, and use a special syntax to refer to variables. $x is a possible syntax, similar to string interpolation. Or are there technical problems with that approach?

The problem with variables referring to columns/fields by default is that something like @map(log(a)) is now weird: is log referring to the log column?

I should revisit $, thanks for point that out. Something like @map(log($a)) would probably work if I were to just rewrite $x into _.x… I think one of the reasons I’ve been hesitating about that kind of change is that I still have some hope that at some point I can get rid of the macros in the query standalone commands, so I’ve tried to only introduce syntax that I think stands at least some chance of making it into base at some point. Now, admittedly, I’ve not followed that philosophy very strictly, but it is still in the back of my mind.

Yes, you’d have to treat function calls differently, but that shouldn’t be a problem. You’d still have to use $ when you pass a function to another function, but hopefully that’s not too common in queries?

I personally think the current behavior where Symbols are interpreted as columns is perfectly nice. $ seems even better, I have seen it used in other contexts so it should be fine here. I don’t really see why being able to refer to column names without specifying that they are columns is a big issue. I would find it annoying if the default assumption within query macros was that everything is a column name (if that could even work).

I also feel that defaulting to “every variable refers to a column” may be a bit too much, but it’d be good to have a standard way to refer to a column in these macro context. Right now the field is a bit split (for example, DataFramesMeta and StatPlots use :a versus _.a in Query). $a would probably be nice and reasonably unambiguous. Symbols are really not ideal in some contexts as one could be using real symbols as well - particularly problematic in Plots call - not to mention that it makes it impossible to pass things programmatically: I can’t say col = :a; @df df plot(col) as the macro couldn’t know before runtime that there is a symbol.
_.a instead only makes sense to those familiar with a yet to be added in Julia _ syntax for currification.

On a related note, I’m wondering whether it’d make sense to have a IndexedTablesMeta package:

  • IndexedTables have all type information about the columns, so no tricks would be required for good performance
  • Iterating on rows is performant already, so one could have an efficient @byrow (or base several things on a row by row implementation)
  • The select = ... statements could be filled in automatically by the macros, only selecting columns that were used
  • Some operations are otherwise painful to do: if I want to add column :c as the difference of :a and :b I need to do:
pushcol(df, :c, column.(df, :a) .- column(df, :b))

as opposed to:

@transform(df,  c = :a .+ :b)
# or
@transform(df, c = $a .+ $b)

Would it be OK to the DataFramesMeta authors if I started playing around with an (openly inspired by DataFramesMeta) IndexedTablesMeta package?