Hi all, I want to share a cool upcoming feature of DataFrameMacros with you (currently in this PR https://github.com/jkrumbiegel/DataFrameMacros.jl/pull/15) and get some thoughts on two specific issues.
The problem I wanted to solve was the issue of transformations over multiple columns. As you probably know, DataFrameMacros and DataFramesMeta, which so far worked relatively similar, only work for transformations on single columns. So something like :x + :y
works, but there is nothing like dplyrโs across
functionality. But this is something I badly wanted to have and thought about how to get it.
First I wanted to include an additional construct like @across
, but then it hit me, why not use the existing machinery of DataFrames better. So far, Iโve constructed src => function => sink
expressions in my macros, but why not just change that to srcs .=> function[s] .=> sinks
. The nice thing is that the new functionality doesnโt break the old one at all, as all the single-column specifiers (Symbol, String, Int) just broadcast to single results, but now you get the ability to use any multi-column specifier in a functional expression and the resulting mini-language construct will be auto-broadcasted over all columns.
Here Iโll copy-paste a few examples without context, for more explanation have a look at the preview docs at Tutorial ยท DataFrameMacros.jl
Selection with a Function
julia> @select(df, $(endswith("e")))
891ร3 DataFrame
Row โ Name Age Fare
โ String Float64? Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Braund, Mr. Owen Harris 22.0 7.25
2 โ Cumings, Mrs. John Bradley (Florโฆ 38.0 71.2833
3 โ Heikkinen, Miss. Laina 26.0 7.925
4 โ Futrelle, Mrs. Jacques Heath (Liโฆ 35.0 53.1
โฎ โ โฎ โฎ โฎ
889 โ Johnston, Miss. Catherine Helen โฆ missing 23.45
890 โ Behr, Mr. Karl Howell 26.0 30.0
891 โ Dooley, Mr. Patrick 32.0 7.75
884 rows omitted
Transformation over multiple columns selected by Type
julia> @select(df, Float32($Real))
891ร6 DataFrame
Row โ PassengerId_Float32 Survived_Float32 Pclass_Float32 SibSp_Float32 Parch_Float32 Fare_Float32
โ Float32 Float32 Float32 Float32 Float32 Float32
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1.0 0.0 3.0 1.0 0.0 7.25
2 โ 2.0 1.0 1.0 1.0 0.0 71.2833
3 โ 3.0 1.0 3.0 0.0 0.0 7.925
4 โ 4.0 1.0 1.0 1.0 0.0 53.1
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ
889 โ 889.0 0.0 3.0 1.0 2.0 23.45
890 โ 890.0 1.0 1.0 0.0 0.0 30.0
891 โ 891.0 0.0 3.0 0.0 0.0 7.75
884 rows omitted
Transformation of sink column names
julia> @select(df, lowercase.($Real) .* "_32" = Float32($Real))
891ร6 DataFrame
Row โ passengerid_32 survived_32 pclass_32 sibsp_32 parch_32 fare_32
โ Float32 Float32 Float32 Float32 Float32 Float32
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1.0 0.0 3.0 1.0 0.0 7.25
2 โ 2.0 1.0 1.0 1.0 0.0 71.2833
3 โ 3.0 1.0 3.0 0.0 0.0 7.925
4 โ 4.0 1.0 1.0 1.0 0.0 53.1
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ
889 โ 889.0 0.0 3.0 1.0 2.0 23.45
890 โ 890.0 1.0 1.0 0.0 0.0 30.0
891 โ 891.0 0.0 3.0 0.0 0.0 7.75
884 rows omitted
Multi-dimensional broadcasting
julia> @select(df, ["a" "c"; "b" "d"] = $[:Survived, :Pclass] * $(permutedims([:Survived, :Pclass])))
891ร4 DataFrame
Row โ a b c d
โ Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 0 0 0 9
2 โ 1 1 1 1
3 โ 1 3 3 9
4 โ 1 1 1 1
โฎ โ โฎ โฎ โฎ โฎ
889 โ 0 0 0 9
890 โ 1 1 1 1
891 โ 0 0 0 9
884 rows omitted
RFC part
There are two things Iโm currently thinking about. One is that $(All())
is not the most sparse syntax and theoretically, the macro could treat All()
as a column identifier because thatโs DataFrames syntax and shouldnโt mean anything else. So should All()
, Between()
, etc be special cased for not having to use $
?
The other thing is that I would like an option to broadcast parts of the functional expression itself as well, currently there will only be one function broadcasted across all column combinations. But letโs say you wanted to add increasing numbers to the first 100 columns, my idea is to use the @b
flag macro to signal that a part of the expression should be taken out and broadcasted with the rest. Something like this:
@select(df, $(1:100) + @b(1:100)
# would be equivalent to
select(df, names(df, 1:100) .=> broadcast(i -> ByRow(x -> x + i), 1:100)))
It looks a bit cryptic at first maybe, but it could be quite powerful. If you have an opinion, Iโd be interested in hearing it to form a good decision.