That is pretty clean to read and reason through. I will have to get used to df
just floating there in the middle and the interpolated variables, but thatβs not unique to DataFramesMeta.jl
. Iβm still afraid of BenchmarkTools.jl too.
Does @subset!
from DataFramesMeta
essentially replace filter!
from DataFrames
? Why do the names differ?
The names differ for a few reasons
-
subset
acts on full columns, while the operations onfilter
act on rows. We canβt makefilter
act on columns because a DataFrame is conceptualized as a collection of rows, where possible. The contract ofBase.filter
is that the operationf
infilter(f, x)
will act on elements ofx
. - Similarly, we want operations on a
GroupedDataFrame
to be split-apply-combine style transformations. Since aGroupedDataFrame
iterates through sub-dataframes, to keep a contract withBase.filter
, we would needfilter
to return only some groups.
Therefore we need a different function with itβs own name. This gives us the flexibility to have consistent sub-setting behavior with transform
and select
.
@Nathan_Boyer: note that in DataFrames.jl we have both filter!
and subset!
. In general it would be great if you put your concerns in an issue in DataFrames.jl and I will think how to handle them (probably via docs updates).
I already have opened https://github.com/JuliaData/DataFrames.jl/issues/2866.
Not really,
df = DataFrame(Time = [3., 4., 5.], TopTemp = [70., 73., 100.], BottomTemp = [50., 55., 80.])
df[!, :id] = 1:nrow(df)
sdf = stack(df)
usdf = @sqldf """
select variable,
(case when variable like '%Temp' then (value-32)/1.8 else value end) as value,
id
from sdf
"""
unstack(usdf)
3Γ4 DataFrame
Row β id Time TopTemp BottomTemp
β Int64 Float64? Float64? Float64?
ββββββΌβββββββββββββββββββββββββββββββββββββββ
1 β 1 3.0 21.1111 10.0
2 β 2 4.0 22.7778 12.7778
3 β 3 5.0 37.7778 26.6667
However, in general I would say we should not expect DataFrames.jl or any equivalent package to offer direct solutions to all possible transformations we need.
In my case I follow these guidelines when it comes to transformations:
- Simple β DataFrames.jl
- Complex but SQL simple β DataFrames.jl with SQLdf.jl
- Complex and Beyond SQL simple β Julia Code
I want to push back on this.
They key feature of DataFrames is that all of itβs columns are just plain old normal Julia vectors. Everything you can do with Julia vectors you can do in DataFrames. The source => fun => dest
syntax (or alternatively, DataFramesMeta.jl) provides type-stable, performant ways to work with these columns.
But there is no different in performance between working with DataFrames and plain Julia vectors. Additionally, as long as transformations go in a function, you can define as complicated a function you want fun
without ever thinking about DataFrames.jl, and then use it with source => fun => dest
syntax.
This is not necessarily true when working with dplyr
or Pandas.
Additionally, the macro @with
in DataFrames.jl creates an anonymous function which acts on the DataFrames columns and executes it.
Consider the following
@with df begin
z = :x + :y
t = mean(z) * 100
end
Though @with
acts on a DataFrame
, the actual code it produces and executes has nothing to do with any implementation of DataFrames.jl.
I will work on that.
In general:
- Docs are hard to read for those whose data does not benefit from split-apply-combine. (My elements are all unique, so there is nothing to split or groupby.)
- I have trouble figuring out what is and isnβt needed/allowed inside the function syntax:
ByRow()
,eachcol()
,names()
,Cols()
,:
,.=>
,.()
,...
etc. - General confusion about whether a function operates on a row, column, elements of a row, or elements of a column.
- There is no way to test if an argument is valid or test individual pieces of an argument. I would like to be able to print, check type, save to variable, etc. the input and output from each of x => y => z; maybe even deeper if y becomes complicated with nesting.
- Multi-column broadcasting is limited.
julia> filter!(names(df, Not(:node)) .=> ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching !(::Vector{Pair{String, ByRow{var"#204#206"}}})
julia> transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)
ERROR: LoadError: MethodError: no method matching fahrenheit_to_celsius(::Int64, ::Int64)
- Function broadcasting is limited.
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
Note:
I would like to emphasize again that my initial assumption was either:
- I am reasoning about this whole thing wrong.
- DataFrames is built for a different type of data processing and I should be using something else.
Not DataFrames is broken and Iβm mad about it.
Oh, I myself I think I would use Julia code in this case, I was just showing an example with no loops. However, when it comes to transformations performance is not always relevant but rather accuracy and simplicity.
I do think a lot of the problems could be solved by changing the way you reason about these calls. More emphasis should be placed on creating a valid source => fun => dest
object.
source
is a collection of names, i.e. String
s or Symbol
s
fun
, is just a function. ByRow(fun)
is a wrapper function, i.e. f
is a function and ByRow(f)
is a function. Since a function is not iterable, ByRow.(f)
makes no sense.
dest
is most commonly a single column name, like a String
or Symbol
. But can also be a collection of String
s or Symbol
s.
.=>
:
Similarly, r"Temp"
is a regular expression. Because regular expressions are not iterable, r"Temp" .=>
doesnβt make sense.
Since functions are not iterable, we know we have the following behavior
julia> foo(x) = x;
julia> [:a, :b] .=> foo
2-element Vector{Pair{Symbol, typeof(foo)}}:
:a => foo
:b => foo
the fun
will get repeated.
This is all super complicated and requires a lot of pretty detailed knowledge of Juliaβs broadcasting rules. In no way is this meant to answer your specific questions or invalidate your frustrations, which definitely make sense.
I mean only to show that a validation function (which Bogumil has already filed an issue for) might go a long way. If I were to give advice about this, I would say you should put more effort into understanding the way common Julia idioms, like broadcasting and regular expressions, are used in the construction of the source => fun => dest
pairs.
This is a good point. DataFrames.jl could benefit from looking at dplyrβs functions for this: Apply a function (or functions) across multiple columns β across β’ dplyr
Does Not(:node)
create an iterable vector? That is another element I wish I could have inspected for type, but it only makes sense in the context of the DataFrame call.
No, Not(:node)
does not create an iterable vector. And neither do Between
or Cols
at the moment. Or, it was just fixed on master
and I donβt know when the next release will be. For now use names(df, Not(:node))
.
This is definitely frustrating. Iβm glad the JuliaData maintainers have fixed it though. It will make it so people donβt experience that frustration in the future.
In general, I would say, make the pair first, especially if itβs complicated, and then inspect the result with print
. DataFrames is not doing any special evaluation, so if the pair you constructed looks bad, it probably wonβt work.
The following is a complicated pair, but you can see pretty clearly that itβs doing what you want.
julia> p1 = [:a, :b] .=> f .=> [:z1, :z2]
2-element Vector{Pair{Symbol, Pair{typeof(f), Symbol}}}:
:a => (f => :z1)
:b => (f => :z2)
Although there are lots of great macros and other tricks, sometimes itβs easy to just handle stuff using regular iteration and higher order functions like map, filter. Also remember you can create a column by assigning to it
df.hasmissing = map(x->any(ismissing,x),eachrow(df))
Convert temperatures:
df.tempC = map(ftoc, df.tempF)
Filter out rows that are all zero:
filter(x -> all(==(0),x),foo)
I equally find DataFrames frustrating at times, especially coming from R / tidyverse. Take the second example you gave. In R this would be one very readable line of code:
mutate(df, across(contains("Temp"), ~ (.x - 32)*(5/9)))
No need to worry about calling the names
function, regular expressions, creating a separate fahrenheit_to_celsius
function, or giving a rename option.
Yes itβs readable because itβs not a lot of code, but my questions would be what does contains("Temp")
return, is that composable or is it a nonstandard evaluation kind of thing, also what does ~
do in this context and whatβs .x
. Just pointing out that this is a DSL thatβs also not understandable with just R knowledge.
What @jules says above together with
are key to me: DataFrames has built an infinitely composable mini-language based on standard Julia syntax. Admittedly, because it is so powerful and composable, it requires good knowledge of how Julia works and a good mental model of what is happening in the mini-language, but it does not require learning a new language/DSL.
I agree R syntax is usually problematic, but in this case I think it wins. What does it matter what the contains
function returns? All I need to know is that it selects columns that contain a string I provide (you could also swap in the endswith
function in this case). As for the ~ and .x they are useful but kind of ugly shorthand for an anonymous function. I prefer Juliaβs x -> f(x)
syntax but you could just as easily provide a named function instead.
I would argue that this DSL is more intuitive as I read the code: βmutate this data frame across columns that contain βTempβ with this provided functionβ.
I made some more benchmarks with larger vectors and added StructArray:
julia> node = rand(Int8, 10000); x = rand(Int8, 10000); y = rand(Int8,10000);
julia> sa = StructArray(node = node, x = x, y = y)
julia> df = DataFrame(node = node, x = x, y = y)
julia> @benchmark $sa[($sa.x .== 0) .| ($sa.y .== 0),:]
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max): 3.542 ΞΌs β¦ 494.832 ΞΌs β GC (min β¦ max): 0.00% β¦ 94.94%
Time (median): 3.756 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.266 ΞΌs Β± 10.912 ΞΌs β GC (mean Β± Ο): 7.69% Β± 3.04%
βββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββ
ββ
β
β
βββββ
β
β
β
β
β β
3.54 ΞΌs Histogram: log(frequency) by time 5.82 ΞΌs <
Memory estimate: 6.02 KiB, allocs estimate: 6.
julia> @benchmark $df[($df.x .== 0) .| ($df.y .== 0),:]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 11.897 ΞΌs β¦ 3.326 ms β GC (min β¦ max): 0.00% β¦ 99.04%
Time (median): 12.627 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 13.714 ΞΌs Β± 33.332 ΞΌs β GC (mean Β± Ο): 2.40% Β± 0.99%
βββββββ
βββββ βββ
β
β
βββββ ββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββ
βββββ
β
11.9 ΞΌs Histogram: log(frequency) by time 20.4 ΞΌs <
Memory estimate: 7.92 KiB, allocs estimate: 25.
julia> @benchmark @subset($df, ((:x .== 0) .| (:y .== 0)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 64.454 ΞΌs β¦ 5.446 ms β GC (min β¦ max): 0.00% β¦ 93.48%
Time (median): 66.507 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 69.103 ΞΌs Β± 75.999 ΞΌs β GC (mean Β± Ο): 1.48% Β± 1.33%
ββββ
βββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
64.5 ΞΌs Histogram: frequency by time 89.7 ΞΌs <
Memory estimate: 21.28 KiB, allocs estimate: 170.
It looks like StructArrays is a clear winner at least for speed.
Itβs certainly intuitive to read for a lot of people. Some people value a βfrom first principlesβ approach higher, where hopefully nothing works βmagicallyβ but everything can be derived from the simple building blocks of the language. Thatβs what base DataFrames does, but of course it trades off syntactic simplicity. I agree that the macro packages DataFramesMeta / DataFrameMacros might take another look at this across
scenario, maybe there are improvements to be made there.
As others have said, itβs a bit of a special case that applying a function to multiple columns is currently not covered by the macro packages.
But I still think DataFrames.jl wins here because you donβt need to learn all this fancy stuff (whether ByRow
and =>
and cpycols
in Julia, or accross
, contains
, ~
and .x
for R) to get the job done. You can just write:
df[:, r"Temp"] = (df[:, r"Temp"] .- 32) .* (5/9)
Sprinkle in some !
and .
for performance, but thatβs not necessary.
(To be honest there is one non obvious trick to know if the original columns where Int
: you cannot put Float64
values in tem so you need to write df[!, r"Temp"] = ...
to replace the columns with the right hand side, rather than overwrite them.)