Frustrated using DataFrames

pdeffebach · September 9, 2021, 12:46pm

I agree to some extent. At the end of the day, any implementation of across will just construct a vector of source => fun => dest pairs. I thought @xiaodai had a package for that but now I can’t find it.

I think this is in a bit of an uncanny valley. Introductory users who have trouble with the precedence and broadcasting will like across, but once users get the hang of how to construct complicated arrays of pairs, they will just do source => fun => dest directly.

The question is whether we should try harder to push introductory users over that hump. I’ve filed an issue here in DataFramesMeta to keep track of this feature.

CameronBieganek · September 9, 2021, 1:23pm

Unless you read the source code, every function is magical to some extent. A function somebody else wrote is a black-box that you can only interact with via its inputs and outputs. Whether a function accepts

names(df, r"Temp") .=> ByRow(t -> (t - 32) * 5/9)

or

Across(contains("Temp"), t -> (t - 32) * 5/9)

is mostly a matter of taste. (I prefer the second one.)

The src => transform => dst syntax has been referred to as a “mini-language”. What is a mini-language if not a DSL? The argument that dplyr uses a DSL and DataFrames.jl does not doesn’t really hold water in my opinion.

pdeffebach · September 9, 2021, 1:26pm

But

names(df, r"Temp") .=> ByRow(t -> (t - 32) * 5/9)

produces an object which you can inspect.

Wheras you the code

across(contains("Temp"), ~ (.x - 32)*(5/9))

can’t be run on it’s own, right? So there is no way to separate the syntax issues from the actual operation.

EDIT: This argument is less persuasive when people start doing Cols(r"Temp") .=> [f1, f2], which will produce some sort of confusing BroadCasted object rather than a vector of pairs.

jules · September 9, 2021, 1:45pm

Yes, the point is that the inner code doesn’t change its meaning through the outer code, because it’s evaluated first. Doing it differently is macro territory in Julia and clearly marked by @, but in R it’s not immediately visible. That was what I was commenting on, not how complex the function is.

CameronBieganek · September 9, 2021, 1:48pm

What you’re commenting on is non-standard evaluation, which does not appear anywhere in this example:

mutate(df, across(contains("Temp"), ~ (.x - 32)*(5/9)))

EDIT: Or at least there is no need for non-standard evaluation here. I can’t say for sure whether or not they are in fact using non-standard evaluation. across could be waiting to see what’s inside df…

bkamins · September 9, 2021, 1:50pm

Yes, and this is a RECOMMENDED way to debug DataFrames.jl transformation calls. Just extract out the transformation specification and check it it is correct (I know it already was commented on but it is really important).

Especially as in:

mutate(df, across(contains("Temp"), ~ (.x - 32)*(5/9)))

the across function in order to properly evaluate contains("Temp") MUST know it works in df context (which is only available in outer mutate call).

That is why we need some special mechanics (I am adding it now) to make select(df, Not(:col) .=> fun) work. As Not(:col) “stand alone” is not aware of df context. It is doable fortunately. The only problem is that it is more complex than e.g. mat[begin, end] is doable in Julia Base, because this is baked in into Julia Base, while Not(:col) cannot be.

pdeffebach · September 9, 2021, 1:50pm

But it still can’t be used outside of dplyr verbs for debugging.

r$> across(contains("Temp"), ~ (.x - 32)*(5/9))                            
Error: `across()` must only be used inside dplyr verbs.
Run `rlang::last_error()` to see where the error occurred.

CameronBieganek · September 9, 2021, 1:55pm

I get a somewhat different error message on my machine, but same idea:

> contains("temp")
Error: `contains()` must be used within a *selecting* function.
ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
Run `rlang::last_error()` to see where the error occurred.

The point is, there is absolutely nothing wrong with this syntax in Julia:

transform(df, Across(contains("Temp"), t -> (t - 32) * 5/9))

Across would create an object of type Across which could be used for dispatch and for special printing. We can’t do special printing on an array of pairs because that would be type piracy.

pdeffebach · September 9, 2021, 1:58pm

I agree with you that Across would be nice to have. And think it should probably go in a package that is re-exported by DataFramesMeta.

This is getting into the details a bit, but I think adding a new Across object with a new method transform would be overkill. transform is complicated enough as-is. It should still construct pairs, since that’s easier to debug inside the transform call.

bkamins · September 9, 2021, 2:04pm

Then it would have to be in DataFramesMeta.jl (as it must receive the data frame context to construct pairs).

What @CameronBieganek proposes is doable also. If we agree we could add Across to DataFrames.jl. The point is that:

if we want it we will have the same level of complexity no matter where we add it
it does not add much complexity in DataFrames.jl as it can be handled in pre-processing (i.e. Across would not slip into the inner processing functions, it would be resolved at the same time when we merge arrays and scalars specifying transformations into one big vector of requested transformations); this is exactly the same place where Not(:a) .=> fun will be handled. Here, as a second thought - if we would add Across maybe we do not need broadcasting form Not(:a) .=> fun? (or we want it both? CC @nalimilan )

dlakelan · September 9, 2021, 3:23pm

readable is in the eye of the beholder. In my opinion this is complete garbage that looks like Perl line noise, and I have no idea what it actually does (and therefore how to compose it with other things etc).

In any case, for transformations like this, I much prefer a simple (and in Julia, highly performant) loop:

for colname in ...
   newname = ...
   df[:,newname] = calcsomething(df[:,colname])
end

completely transparent what is going to happen here.

CameronBieganek · September 9, 2021, 3:26pm

Yeah, it’s easy to get wrapped up in the fancy new syntaxes and forget that a simple for loop works just as well. And as you mentioned, the for loop might be more understandable in many cases.

pdeffebach · September 9, 2021, 3:36pm

I also want to stress that with the @aside macro, you can do this kind of loop inside the @chain block. This is something you can’t do in dplyr, and is probably the cause of some of it’s more complicated all-in-one-call syntaxes.

julia> using DataFramesMeta;

julia> df = DataFrame(rand(20, 10), :auto);

julia> @chain df begin 
           @rtransform :y = :x1 + :x2
           @aside begin 
               nms = names(df, Between(:x5, :x9))
               for n in nms
                   @rtransform! _ $n = $n + 100
               end
               _
           end
           @rtransform :z = :x8 + :x9
       end

piever · September 9, 2021, 3:47pm

Thanks for mentioning that, I had completely forgotten that some of Base methods for still did not have overload for StructArrays. I’m adding that in #206, so filter! should work from the next release on.

I confess I also use StructArrays as a basic table implementation, but I wouldn’t recommend it for general use. Still, it is probably quite handy when you want to run some operation on a few columns and want to optimize performance.

sijo · September 9, 2021, 3:55pm

To be consistent we would actually need

transform(df, Across(contains("Temp"), t -> (t .- 32) * 5/9)), renamecols=false)

I think a more natural addition would be Matching as in

transform(df, Matching("Temp") .=> t -> (t .- 32) * 5/9), renamecols=false)

and we could also extend mapcols as in

mapcols(t -> (t .- 32) * 5/9), df, Matching("Temp"))

Nathan_Boyer · September 9, 2021, 4:25pm

That is very helpful and probably also something to add to the documentation. I thought source => fun => dest created some kind of magic custom struct, so the inputs could be equally magic. This example helps illustrate what is going on and guides what the inputs should be. (Although I am surprised that the inner pair is the rightmost one).

Nathan_Boyer · September 9, 2021, 4:47pm

I think piecewise inspection of source => fun => dest is the crux of the issue. Any syntax which doesn’t work outside the context of the DataFrame makes debugging harder. If source contains Not(), Cols(), r"col", :col, etc. then you need to call names(df, _) for that to stand alone. Similarly, if fun contains ByRow() or any of the above indexing, then it also cannot be tested alone. I would actually be against adding Across because I think it adds to this problem. Broadcasting is not difficult if I understand the types on both sides. I would rather see the effort go toward easier inspection of source, fun, dest individually and combined.

I could make a counterargument that I shouldn’t have to type df twice in transform(df, names(df, Not(:col)) .=> fun). However, I would rather the base package be composable and the meta package help eliminate redundancy.

dlakelan · September 9, 2021, 5:07pm

This is one key thing that Julia has going for it. There is no magic. Anything not subject to @ is evaluated as base Julia objects. Within a macro the syntax is still base Julia but the code being evaluated is the output of the macro, so either read the docs for the macro, or use @macroexpand

CameronBieganek · September 9, 2021, 5:22pm

It would be nice if the DataFrames pairs syntax could handle a renaming function in the third position, so that the names of the new columns can be determined programatically. It’s currently possible to do this if you calculate the new names beforehand, like this:

cols = names(df, contains("temp"))
new_cols = cols .* "_celsius"
transform(df, cols => (t -> (t - 32) * 5/9) => new_cols)

However, if you have to calculate the source names and the destination names beforehand, it kind of defeats the purpose of the concise pairs syntax. Adding a renaming function is something that an Across type could easily handle. For example:

julia> df = DataFrame(temp1 = 70:71, temp2 = 80:81)
2×2 DataFrame
 Row │ temp1  temp2 
     │ Int64  Int64 
─────┼──────────────
   1 │    70     80
   2 │    71     81

julia> transform(df,
           Across(contains("temp");
               apply = t -> (t - 32) * 5/9,
               renamer = col -> col * "_celsius"
           )
       )
2×4 DataFrame
 Row │ temp1  temp2  temp1_celsius  temp2_celsius 
     │ Int64  Int64  Float64        Float64       
─────┼────────────────────────────────────────────
   1 │    70     80        21.1111        26.6667
   2 │    71     81        21.6667        27.2222

See implementation details below. Note that I’ve made the applied function act by row, just to make things easier on myself.

An additional benefit of Across here is that it can be saved as a reusable object, acr = Across(...), because it makes no reference to the specific column names in df. Note that the cols => (t -> (t - 32) * 5/9) => new_cols object from the first example is not reusable because it refers to specific columns in df.

For fun, I’ve also implemented a preview function that previews what Across will do:

julia> preview(df,
           Across(contains("temp");
               apply = t -> (t - 32) * 5/9,
               renamer = col -> col * "_celsius"
           )
       )

2×3 DataFrame
 Row │ source  transformation  destination   
     │ String  var"#14#16"     String        
─────┼───────────────────────────────────────
   1 │ temp1   #14             temp1_celsius
   2 │ temp2   #14             temp2_celsius

Unfortunately anonymous functions don’t print very nicely, which is why we have #14 in the transformation column. If you use a named function, it prints nicer:

julia> plus_one(x) = x + 1
plus_one (generic function with 1 method)

julia> preview(df, Across(contains("temp"); apply=plus_one))
2×3 DataFrame
 Row │ source  transformation  destination    
     │ String  #plus_one…      String         
─────┼────────────────────────────────────────
   1 │ temp1   plus_one        temp1_plus_one
   2 │ temp2   plus_one        temp2_plus_one

Minimal Implementation

using DataFrames

struct Across{S, F, R}
    selector::S
    f::F
    renamer::R
end

function Across(selector; apply, renamer = col -> col * "_" * string(apply))
    Across(selector, apply, renamer)
end

function DataFrames.transform(df::AbstractDataFrame, across::Across)
    selector, f, renamer = across.selector, across.f, across.renamer
    newdf = copy(df)

    cols = names(newdf, selector)
    for col in cols
        newdf[:, renamer(col)] = f.(newdf[:, col])
    end

    newdf
end

function preview(df::AbstractDataFrame, a::Across)
    cols = names(df, a.selector)
    DataFrame(
        source = cols,
        transformation = a.f,
        destination = a.renamer.(cols)
    )
end

Nathan_Boyer · September 9, 2021, 5:24pm

I’m sure there is a good reason, but why not use methods transform(df, source, fun, dest), transform(df, source, fun), etc. instead? Then the error messages could be more specific.

Topic		Replies	Views
Problems about dealing with missing values, maybe connected to DataFrames.jl Data question	4	759	December 4, 2018
Query.jl with filtering by missing values doesn't seem to work? General Usage	8	1650	January 23, 2018
Problem with package TableView / Julia usage New to Julia	12	1462	May 27, 2020
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017
Nullables - why? and how? New to Julia	6	2452	December 19, 2017

Frustrated using DataFrames

Related topics