Frustrated using DataFrames

That is pretty clean to read and reason through. I will have to get used to df just floating there in the middle and the interpolated variables, but that’s not unique to DataFramesMeta.jl. I’m still afraid of BenchmarkTools.jl too.

2 Likes

Does @subset! from DataFramesMeta essentially replace filter! from DataFrames? Why do the names differ?

The names differ for a few reasons

  1. subset acts on full columns, while the operations on filter act on rows. We can’t make filter act on columns because a DataFrame is conceptualized as a collection of rows, where possible. The contract of Base.filter is that the operation f in filter(f, x) will act on elements of x.
  2. Similarly, we want operations on a GroupedDataFrame to be split-apply-combine style transformations. Since a GroupedDataFrame iterates through sub-dataframes, to keep a contract with Base.filter, we would need filter to return only some groups.

Therefore we need a different function with it’s own name. This gives us the flexibility to have consistent sub-setting behavior with transform and select.

@Nathan_Boyer: note that in DataFrames.jl we have both filter! and subset!. In general it would be great if you put your concerns in an issue in DataFrames.jl and I will think how to handle them (probably via docs updates).

I already have opened https://github.com/JuliaData/DataFrames.jl/issues/2866.

2 Likes

Not really,

df = DataFrame(Time = [3., 4., 5.], TopTemp = [70., 73., 100.], BottomTemp = [50., 55., 80.])
df[!, :id] = 1:nrow(df)
sdf = stack(df)

usdf = @sqldf """
           select variable,
           (case when variable like '%Temp' then (value-32)/1.8 else value end) as value,
           id
           from sdf
       """

unstack(usdf)

3Γ—4 DataFrame
 Row β”‚ id     Time      TopTemp   BottomTemp 
     β”‚ Int64  Float64?  Float64?  Float64?   
─────┼───────────────────────────────────────
   1 β”‚     1       3.0   21.1111     10.0
   2 β”‚     2       4.0   22.7778     12.7778
   3 β”‚     3       5.0   37.7778     26.6667

However, in general I would say we should not expect DataFrames.jl or any equivalent package to offer direct solutions to all possible transformations we need.

In my case I follow these guidelines when it comes to transformations:

  1. Simple β†’ DataFrames.jl
  2. Complex but SQL simple β†’ DataFrames.jl with SQLdf.jl
  3. Complex and Beyond SQL simple β†’ Julia Code
1 Like

I want to push back on this.

They key feature of DataFrames is that all of it’s columns are just plain old normal Julia vectors. Everything you can do with Julia vectors you can do in DataFrames. The source => fun => dest syntax (or alternatively, DataFramesMeta.jl) provides type-stable, performant ways to work with these columns.

But there is no different in performance between working with DataFrames and plain Julia vectors. Additionally, as long as transformations go in a function, you can define as complicated a function you want fun without ever thinking about DataFrames.jl, and then use it with source => fun => dest syntax.

This is not necessarily true when working with dplyr or Pandas.

Additionally, the macro @with in DataFrames.jl creates an anonymous function which acts on the DataFrames columns and executes it.

Consider the following

@with df begin 
    z = :x + :y 
    t = mean(z) * 100
end

Though @with acts on a DataFrame, the actual code it produces and executes has nothing to do with any implementation of DataFrames.jl.

3 Likes

I will work on that.

In general:

  1. Docs are hard to read for those whose data does not benefit from split-apply-combine. (My elements are all unique, so there is nothing to split or groupby.)
  2. I have trouble figuring out what is and isn’t needed/allowed inside the function syntax: ByRow() , eachcol() , names() , Cols() , : , .=> , .() , ... etc.
  3. General confusion about whether a function operates on a row, column, elements of a row, or elements of a column.
  4. There is no way to test if an argument is valid or test individual pieces of an argument. I would like to be able to print, check type, save to variable, etc. the input and output from each of x => y => z; maybe even deeper if y becomes complicated with nesting.
  5. Multi-column broadcasting is limited.
julia> filter!(names(df, Not(:node)) .=> ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching !(::Vector{Pair{String, ByRow{var"#204#206"}}})

julia> transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)
ERROR: LoadError: MethodError: no method matching fahrenheit_to_celsius(::Int64, ::Int64)
  1. Function broadcasting is limited.
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})

julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved

Note:
I would like to emphasize again that my initial assumption was either:

  • I am reasoning about this whole thing wrong.
  • DataFrames is built for a different type of data processing and I should be using something else.

Not DataFrames is broken and I’m mad about it.

1 Like

Oh, I myself I think I would use Julia code in this case, I was just showing an example with no loops. However, when it comes to transformations performance is not always relevant but rather accuracy and simplicity.

I do think a lot of the problems could be solved by changing the way you reason about these calls. More emphasis should be placed on creating a valid source => fun => dest object.

source is a collection of names, i.e. Strings or Symbols

fun, is just a function. ByRow(fun) is a wrapper function, i.e. f is a function and ByRow(f) is a function. Since a function is not iterable, ByRow.(f) makes no sense.

dest is most commonly a single column name, like a String or Symbol. But can also be a collection of Strings or Symbols.

.=>:

Similarly, r"Temp" is a regular expression. Because regular expressions are not iterable, r"Temp" .=> doesn’t make sense.

Since functions are not iterable, we know we have the following behavior

julia> foo(x) = x;

julia> [:a, :b] .=> foo
2-element Vector{Pair{Symbol, typeof(foo)}}:
 :a => foo
 :b => foo

the fun will get repeated.

This is all super complicated and requires a lot of pretty detailed knowledge of Julia’s broadcasting rules. In no way is this meant to answer your specific questions or invalidate your frustrations, which definitely make sense.

I mean only to show that a validation function (which Bogumil has already filed an issue for) might go a long way. If I were to give advice about this, I would say you should put more effort into understanding the way common Julia idioms, like broadcasting and regular expressions, are used in the construction of the source => fun => dest pairs.

3 Likes

This is a good point. DataFrames.jl could benefit from looking at dplyr’s functions for this: Apply a function (or functions) across multiple columns β€” across β€’ dplyr

3 Likes

Does Not(:node) create an iterable vector? That is another element I wish I could have inspected for type, but it only makes sense in the context of the DataFrame call.

No, Not(:node) does not create an iterable vector. And neither do Between or Cols at the moment. Or, it was just fixed on master and I don’t know when the next release will be. For now use names(df, Not(:node)).

This is definitely frustrating. I’m glad the JuliaData maintainers have fixed it though. It will make it so people don’t experience that frustration in the future.

In general, I would say, make the pair first, especially if it’s complicated, and then inspect the result with print. DataFrames is not doing any special evaluation, so if the pair you constructed looks bad, it probably won’t work.

The following is a complicated pair, but you can see pretty clearly that it’s doing what you want.

julia> p1 = [:a, :b] .=> f .=> [:z1, :z2]
2-element Vector{Pair{Symbol, Pair{typeof(f), Symbol}}}:
 :a => (f => :z1)
 :b => (f => :z2)
1 Like

Although there are lots of great macros and other tricks, sometimes it’s easy to just handle stuff using regular iteration and higher order functions like map, filter. Also remember you can create a column by assigning to it

df.hasmissing = map(x->any(ismissing,x),eachrow(df))

Convert temperatures:

df.tempC = map(ftoc, df.tempF)

Filter out rows that are all zero:

filter(x -> all(==(0),x),foo)
4 Likes

I equally find DataFrames frustrating at times, especially coming from R / tidyverse. Take the second example you gave. In R this would be one very readable line of code:

mutate(df, across(contains("Temp"), ~ (.x - 32)*(5/9)))

No need to worry about calling the names function, regular expressions, creating a separate fahrenheit_to_celsius function, or giving a rename option.

3 Likes

Yes it’s readable because it’s not a lot of code, but my questions would be what does contains("Temp") return, is that composable or is it a nonstandard evaluation kind of thing, also what does ~ do in this context and what’s .x. Just pointing out that this is a DSL that’s also not understandable with just R knowledge.

5 Likes

What @jules says above together with

are key to me: DataFrames has built an infinitely composable mini-language based on standard Julia syntax. Admittedly, because it is so powerful and composable, it requires good knowledge of how Julia works and a good mental model of what is happening in the mini-language, but it does not require learning a new language/DSL.

3 Likes

I agree R syntax is usually problematic, but in this case I think it wins. What does it matter what the contains function returns? All I need to know is that it selects columns that contain a string I provide (you could also swap in the endswith function in this case). As for the ~ and .x they are useful but kind of ugly shorthand for an anonymous function. I prefer Julia’s x -> f(x) syntax but you could just as easily provide a named function instead. :man_shrugging:t2:

I would argue that this DSL is more intuitive as I read the code: β€œmutate this data frame across columns that contain β€œTemp” with this provided function”.

2 Likes

I made some more benchmarks with larger vectors and added StructArray:

julia> node = rand(Int8, 10000); x = rand(Int8, 10000); y = rand(Int8,10000);
julia> sa = StructArray(node = node, x = x, y = y)
julia> df = DataFrame(node = node, x = x, y = y)

julia> @benchmark $sa[($sa.x .== 0) .| ($sa.y .== 0),:]
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.542 ΞΌs … 494.832 ΞΌs  β”Š GC (min … max): 0.00% … 94.94%
 Time  (median):     3.756 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.266 ΞΌs Β±  10.912 ΞΌs  β”Š GC (mean Β± Οƒ):  7.69% Β±  3.04%

   β–„β–‡β–ˆβ–ˆβ–‡β–‡β–†β–„β–ƒβ–ƒβ–„β–„β–„β–ƒβ–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–                          β–‚
  β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–…β–†β–…β–…β–…β–†β–†β–†β–„β–…β–…β–…β–…β–…β–† β–ˆ
  3.54 ΞΌs      Histogram: log(frequency) by time      5.82 ΞΌs <

 Memory estimate: 6.02 KiB, allocs estimate: 6.

julia> @benchmark $df[($df.x .== 0) .| ($df.y .== 0),:]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.897 ΞΌs …  3.326 ms  β”Š GC (min … max): 0.00% … 99.04%
 Time  (median):     12.627 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   13.714 ΞΌs Β± 33.332 ΞΌs  β”Š GC (mean Β± Οƒ):  2.40% Β±  0.99%

   β–„β–‡β–ˆβ–ˆβ–‡β–†β–…β–ƒβ–‚β–‚β–‚β– ▁▃▅▅▅▄▄▃▂▁ ▁▂▂▂▂▂▁▁                           β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–†β–†β–„β–β–„β–…β–…β–†β–†β–…β–„β–„β–ƒβ–ƒβ–… β–ˆ
  11.9 ΞΌs      Histogram: log(frequency) by time      20.4 ΞΌs <

 Memory estimate: 7.92 KiB, allocs estimate: 25.

julia> @benchmark @subset($df, ((:x .== 0) .| (:y .== 0)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  64.454 ΞΌs …  5.446 ms  β”Š GC (min … max): 0.00% … 93.48%
 Time  (median):     66.507 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   69.103 ΞΌs Β± 75.999 ΞΌs  β”Š GC (mean Β± Οƒ):  1.48% Β±  1.33%

    β–β–ˆβ–‡β–                                                       
  β–β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  64.5 ΞΌs         Histogram: frequency by time        89.7 ΞΌs <

 Memory estimate: 21.28 KiB, allocs estimate: 170.

It looks like StructArrays is a clear winner at least for speed.

It’s certainly intuitive to read for a lot of people. Some people value a β€œfrom first principles” approach higher, where hopefully nothing works β€œmagically” but everything can be derived from the simple building blocks of the language. That’s what base DataFrames does, but of course it trades off syntactic simplicity. I agree that the macro packages DataFramesMeta / DataFrameMacros might take another look at this across scenario, maybe there are improvements to be made there.

3 Likes

As others have said, it’s a bit of a special case that applying a function to multiple columns is currently not covered by the macro packages.

But I still think DataFrames.jl wins here because you don’t need to learn all this fancy stuff (whether ByRow and => and cpycols in Julia, or accross, contains, ~ and .x for R) to get the job done. You can just write:

df[:, r"Temp"] = (df[:, r"Temp"] .- 32) .* (5/9)

Sprinkle in some ! and . for performance, but that’s not necessary.

(To be honest there is one non obvious trick to know if the original columns where Int: you cannot put Float64 values in tem so you need to write df[!, r"Temp"] = ... to replace the columns with the right hand side, rather than overwrite them.)

4 Likes