Frustrated using DataFrames

To allow multiple arguments. Plus there are exceptions like nrow which can stand alone, might create a lot of method ambiguities when you include these cases.

@Nathan_Boyer makes a good point. If source, fun, and dest were keyword arguments, I would guess that the internal logic in select/transform would be approximately the same as it is right now.

@bkamins @nalimilan Could it be

transform(df, r"temp" => ByValue(t->((t-32)*5/9)) => (c->c*"celsius"))
  • ByValue is like ByRow but the function receives a value instead of a row
  • The third component of the pairs is a renamer function.

With pairs, we can do

transform(df, 
  :a => :b => :c,
  :d => :e => :f,
  :g => :h => :i,
)

which we can’t do with keyword arguments.

1 Like

Good point. Unfortunately I’ve been coding in Matlab and Python lately, so I’m a little rusty on DataFrames.jl details.

However, it could probably be handled with the right API. This may not be elegant, but it would work:

df = DataFrame(a=1:2, b=3:4, c=5:6)

transform(df,
    source = [(:a, ), (:b, :c)],
    fun = [x -> 2x, (x, y) -> x + y],
    dest = [:d, :e]
)

This would apply x -> 2x to column :a and (x, y) -> x + y to columns :b and :c.

One advantage is that it’s a lot more natural to spread keyword arguments over multiple lines than it is to spread a double pair over multiple lines.

I think this would be really cool, but not as an added keyword argument to transform, but rather as a function to make pairs.

make_pair(source = [:a, :b], fun = f, dest = AsTable)

etc. Then you can do

transform(df, make_pair(...))

This could probably live in the same package as Across and friends.

1 Like

Using a for loop is great if I only have 1 transformation to do, but what if I had 10 other transformations I wanted to do with the data? Doing something like this:

using DataFrames, Chain

df = DataFrame(Time = [3, 4, 5, 6], TopTemp = [70, 73, 100, missing], BottomTemp = [50, 55, 80, 90])

fahrenheit_to_celsius(t) = Int(round((t - 32) * 5 / 9))

result = @chain df begin
    dropmissing
    filter(row -> row.TopTemp < 90, _)
    transform!(names(df, r"Temp") .=> ByRow(fahrenheit_to_celsius)) #renamecols=false?
    transform!([:TopTemp,:BottomTemp] => (-) => :DiffTemp)
    #etc...
end

is a lot easier to read, eliminates the use of temporary variables, and self contains all of data wrangling you performed. I can’t speak on the code performance/comparisons to loops, but I would imagine there would be only be a minor cost, especially if you’re using in-place transformations.

See my post above. You can use a for loop inside a @chain block easily with the @aside macro-flag.

1 Like

That’s actually really cool, seems like you can have the best of both worlds in julia :grin:

3 Likes

ByRow is a fully stand alone thing - unrelated to DataFrame object. Simplyfying a bit (to reduce to a single column case) you can think of ByRow(fun) as x -> fun.(x).

1 Like

Yes - we are aware of this limitation. It is on a roadmap. If you opened an issue for this probably it will get a higher priority :smiley:. Thank you!

It is certainly doable. The only reservation is how much we want to complicate the minilanguage (since it is already complex - and this is part of the reasons this thread was started). Can you please open an issue so we can discuss it there?

It should be relatively easy to make a companion package investigating this option (wrapper functions would rewrite the specifications passed to the standard src => fun => dst style).

2 Likes

Out of curiosity, what is confusing about well established terms like mutate, across etc? Your argument spirals down to that everyone should be coding assembly as far as I can read it. Your : and other special signs are no less magical than the other syntax right? Higher levels of abstractions are not always a bad thing in my mind so I just want to understand where you’re coming from here.

the biggest problem with R is that no one ever knows what anything actually means because of β€œnonstandard evaluation”.

when I see something like contains("Temp") I think β€œthat’s a function call on the string β€œTemp” what does it evaluate to?” But it’s NOT a function call on the string Temp. I actually don’t have the slightest idea what it is. It’s really some macroish magic whose value depends on the context in which it appears.

Now look at across(…) that also looks like "a function called on whatever "contains(β€œTemp”) returns and whatever ~ (.x - 32)*(5/9) means. But of course, it’s not that either.

And what does ~(.x - 32)*(5/9) mean? what is the significance of the symbol .x? Does this evaluate to a thing? And how does that thing work?

So it’s not that I disagree with abstraction, it’s more that I disagree with incredibly obfuscated semantics. The beauty of Julia is that the semantics are usually very clear, and the places where the semantics are different are clearly delineated by @ macro calls.

I think the @chain macro is quite nice, its semantics are clear, and it offers a lot of useful abstraction. I like the functional API with filter, and transform and soforth in Julia, again, because tthe semantics are clear.

13 Likes

I just wanted to add that with the awesome 1.3.0 release of DataFrames the temperature example can now be written

df = DataFrame(Time=[3, 4, 5], TopTemp=[70, 73, 100], BottomTemp=[50, 55, 80])

transform(df, Cols(r"Temp") .=> (t->(t.-32)*5/9), renamecols=false)

# Output
3Γ—3 DataFrame
 Row β”‚ Time   TopTemp  BottomTemp 
     β”‚ Int64  Float64  Float64    
─────┼────────────────────────────
   1 β”‚     3  21.1111     10.0
   2 β”‚     4  22.7778     12.7778
   3 β”‚     5  37.7778     26.6667

and @CameronBieganek’s request for renaming columns has been implemented so we can write

transform(df, Cols(r"Temp") .=> (t->(t.-32)*5/9) .=> (n->n*"_celsius"))

# Output
3Γ—5 DataFrame
 Row β”‚ Time   TopTemp  BottomTemp  TopTemp_celsius  BottomTemp_celsius 
     β”‚ Int64  Int64    Int64       Float64          Float64            
─────┼─────────────────────────────────────────────────────────────────
   1 β”‚     3       70          50          21.1111             10.0
   2 β”‚     4       73          55          22.7778             12.7778
   3 β”‚     5      100          80          37.7778             26.6667
18 Likes

Looks like I need to read the release notes. Cool stuff!

Everyone was remarkably patient with this except for some criticism of R.

Regardless of language, there are problems with these ad hoc containers layered onto the real datatypes. APIs for doing the same thing (equivalent semantics) have no similarity. Performance is hard to predict.

It’s really time for a heterogeneous matrix: columns can be different types as long as every element is the same type in a column. Memory access from such a matrix must be slower than from a homogeneous matrix, but every memory location can be calculated. Now, just treat this mythical being like a matrix (2D array).

The closest way to get to array manipulation of a collection of dissimilar columns in Julia today is with Typed Tables, which no one mentioned. I use typed tables to hold simulation data with 16 columns and up to 8 million rows. There are some weirdnesses because the structure is immutable, but generally it’s a collection of vectors.

Indeed, type-stable performant collections are widely useful. Julia is flexible enough to have several reasonably popular implementations of this concept: for example, there is StructArrays in addition to TypedTables you mention. They basically have the same layout, and both implement the Tables interface and can be used in generic tabular functions. For the β€œinverse” row-based layout, there is the built-in Vector-of-NamedTuples, that is also both a table and a regular array.

Thanks.

I’ll take a look at StructArrays. The one problem with the named tuple approach of TypedTables is the horrendous type definition that results.