To allow multiple arguments. Plus there are exceptions like nrow
which can stand alone, might create a lot of method ambiguities when you include these cases.
@Nathan_Boyer makes a good point. If source
, fun
, and dest
were keyword arguments, I would guess that the internal logic in select
/transform
would be approximately the same as it is right now.
@bkamins @nalimilan Could it be
transform(df, r"temp" => ByValue(t->((t-32)*5/9)) => (c->c*"celsius"))
-
ByValue
is likeByRow
but the function receives a value instead of a row - The third component of the pairs is a renamer function.
With pairs, we can do
transform(df,
:a => :b => :c,
:d => :e => :f,
:g => :h => :i,
)
which we canβt do with keyword arguments.
Good point. Unfortunately Iβve been coding in Matlab and Python lately, so Iβm a little rusty on DataFrames.jl details.
However, it could probably be handled with the right API. This may not be elegant, but it would work:
df = DataFrame(a=1:2, b=3:4, c=5:6)
transform(df,
source = [(:a, ), (:b, :c)],
fun = [x -> 2x, (x, y) -> x + y],
dest = [:d, :e]
)
This would apply x -> 2x
to column :a
and (x, y) -> x + y
to columns :b
and :c
.
One advantage is that itβs a lot more natural to spread keyword arguments over multiple lines than it is to spread a double pair over multiple lines.
I think this would be really cool, but not as an added keyword argument to transform
, but rather as a function to make pairs.
make_pair(source = [:a, :b], fun = f, dest = AsTable)
etc. Then you can do
transform(df, make_pair(...))
This could probably live in the same package as Across
and friends.
Using a for loop is great if I only have 1 transformation to do, but what if I had 10 other transformations I wanted to do with the data? Doing something like this:
using DataFrames, Chain
df = DataFrame(Time = [3, 4, 5, 6], TopTemp = [70, 73, 100, missing], BottomTemp = [50, 55, 80, 90])
fahrenheit_to_celsius(t) = Int(round((t - 32) * 5 / 9))
result = @chain df begin
dropmissing
filter(row -> row.TopTemp < 90, _)
transform!(names(df, r"Temp") .=> ByRow(fahrenheit_to_celsius)) #renamecols=false?
transform!([:TopTemp,:BottomTemp] => (-) => :DiffTemp)
#etc...
end
is a lot easier to read, eliminates the use of temporary variables, and self contains all of data wrangling you performed. I canβt speak on the code performance/comparisons to loops, but I would imagine there would be only be a minor cost, especially if youβre using in-place transformations.
See my post above. You can use a for
loop inside a @chain
block easily with the @aside
macro-flag.
Thatβs actually really cool, seems like you can have the best of both worlds in julia
ByRow
is a fully stand alone thing - unrelated to DataFrame
object. Simplyfying a bit (to reduce to a single column case) you can think of ByRow(fun)
as x -> fun.(x)
.
Yes - we are aware of this limitation. It is on a roadmap. If you opened an issue for this probably it will get a higher priority . Thank you!
It is certainly doable. The only reservation is how much we want to complicate the minilanguage (since it is already complex - and this is part of the reasons this thread was started). Can you please open an issue so we can discuss it there?
It should be relatively easy to make a companion package investigating this option (wrapper functions would rewrite the specifications passed to the standard src => fun => dst
style).
Out of curiosity, what is confusing about well established terms like mutate, across etc? Your argument spirals down to that everyone should be coding assembly as far as I can read it. Your : and other special signs are no less magical than the other syntax right? Higher levels of abstractions are not always a bad thing in my mind so I just want to understand where youβre coming from here.
the biggest problem with R is that no one ever knows what anything actually means because of βnonstandard evaluationβ.
when I see something like contains("Temp")
I think βthatβs a function call on the string βTempβ what does it evaluate to?β But itβs NOT a function call on the string Temp. I actually donβt have the slightest idea what it is. Itβs really some macroish magic whose value depends on the context in which it appears.
Now look at across(β¦) that also looks like "a function called on whatever "contains(βTempβ) returns and whatever ~ (.x - 32)*(5/9)
means. But of course, itβs not that either.
And what does ~(.x - 32)*(5/9)
mean? what is the significance of the symbol .x? Does this evaluate to a thing? And how does that thing work?
So itβs not that I disagree with abstraction, itβs more that I disagree with incredibly obfuscated semantics. The beauty of Julia is that the semantics are usually very clear, and the places where the semantics are different are clearly delineated by @
macro calls.
I think the @chain
macro is quite nice, its semantics are clear, and it offers a lot of useful abstraction. I like the functional API with filter, and transform and soforth in Julia, again, because tthe semantics are clear.
I just wanted to add that with the awesome 1.3.0 release of DataFrames the temperature example can now be written
df = DataFrame(Time=[3, 4, 5], TopTemp=[70, 73, 100], BottomTemp=[50, 55, 80])
transform(df, Cols(r"Temp") .=> (t->(t.-32)*5/9), renamecols=false)
# Output
3Γ3 DataFrame
Row β Time TopTemp BottomTemp
β Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββ
1 β 3 21.1111 10.0
2 β 4 22.7778 12.7778
3 β 5 37.7778 26.6667
and @CameronBieganekβs request for renaming columns has been implemented so we can write
transform(df, Cols(r"Temp") .=> (t->(t.-32)*5/9) .=> (n->n*"_celsius"))
# Output
3Γ5 DataFrame
Row β Time TopTemp BottomTemp TopTemp_celsius BottomTemp_celsius
β Int64 Int64 Int64 Float64 Float64
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 3 70 50 21.1111 10.0
2 β 4 73 55 22.7778 12.7778
3 β 5 100 80 37.7778 26.6667
Looks like I need to read the release notes. Cool stuff!
Everyone was remarkably patient with this except for some criticism of R.
Regardless of language, there are problems with these ad hoc containers layered onto the real datatypes. APIs for doing the same thing (equivalent semantics) have no similarity. Performance is hard to predict.
Itβs really time for a heterogeneous matrix: columns can be different types as long as every element is the same type in a column. Memory access from such a matrix must be slower than from a homogeneous matrix, but every memory location can be calculated. Now, just treat this mythical being like a matrix (2D array).
The closest way to get to array manipulation of a collection of dissimilar columns in Julia today is with Typed Tables, which no one mentioned. I use typed tables to hold simulation data with 16 columns and up to 8 million rows. There are some weirdnesses because the structure is immutable, but generally itβs a collection of vectors.
Indeed, type-stable performant collections are widely useful. Julia is flexible enough to have several reasonably popular implementations of this concept: for example, there is StructArrays in addition to TypedTables you mention. They basically have the same layout, and both implement the Tables interface and can be used in generic tabular functions. For the βinverseβ row-based layout, there is the built-in Vector-of-NamedTuples, that is also both a table and a regular array.
Thanks.
Iβll take a look at StructArrays. The one problem with the named tuple approach of TypedTables is the horrendous type definition that results.