Overview
DataFrames.jl
seems to be the preferred way to work with tabular data in Julia, but it never does what I want the first time. The syntax I need to contrive to do things I think would be simple is mind-boggling, especially when I try to explain what my code is doing to non-coders. It makes me wonder if there is a different package I should be using for my typical workflow, since the things I want to do often are so hard. I am usually joining CSVs together, filtering, sorting, and computing some new columns. These all sound like functions provided by DataFrames, but I donβt think I do it as explained in The Split-Apply-Combine Strategy.
I have read the documentation and done the tutorial on Github. I nodded along as I went through, but using DataFrames
on my own projects has been a different experience. I always think βNo problem, I will whip this up in 10 minutes.β but an hour later am still sifting through error messages and documentation.
Examples
-
Filter out rows which are all zeros.
Code for Example 3
julia> using DataFrames
julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4Γ3 DataFrame
Row β node x y
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 1 0
2 β 2 0 0
3 β 3 9 12
4 β 4 5 8
Trying to use the previous StackOverflow answers as a guide:
julia> filter!(names(df, Not(:node)) .=> ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching !(::Vector{Pair{String, ByRow{var"#204#206"}}})
Closest candidates are:
!(::Bool) at bool.jl:33
!(::Function) at operators.jl:968
!(::Missing) at missing.jl:101
Stacktrace:
[1] filter!(f::Vector{Pair{String, ByRow{var"#204#206"}}}, df::DataFrame)
@ DataFrames C:\Users\nboyer.AIP\.julia\packages\DataFrames\vuMM8\src\abstractdataframe\abstractdataframe.jl:1127
[2] top-level scope
@ REPL[106]:1
This works but destroys df
:
julia> filter!(row -> any(x -> x>0, row), df[!,Not(:node)])
3Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 0
2 β 9 12
3 β 5 8
julia> df
Error showing value of type DataFrame:
ERROR: AssertionError: Data frame is corrupt: length of column :x (3) does not match length of column 1 (4). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).
Stacktrace:
[1] _check_consistency(df::DataFrame)
@ DataFrames C:\Users\nboyer.AIP\.julia\packages\DataFrames\vuMM8\src\dataframe\dataframe.jl:447
Finally, after much frustration, this works as expected (nested anonymous functions with a tuple splat that is part of a pair using no broadcasting dots):
julia> filter!(Not(:node) => (row...) -> any(x -> x>0, row), df)
3Γ3 DataFrame
Row β node x y
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 1 0
2 β 3 9 12
3 β 4 5 8
Final Thoughts
Perhaps the most frustrating part is that I am unable to reason my way through the error messages. It is usually impossible to pare the syntax problem down to something smaller and make sure the pieces that make up args
are passing the types and data structures I expect (row, column, vector, scalar, etc.). You have to get it all right in one go, or you will just get a random method error somewhere.
I am not trying to complain that the package is bad. It seems very powerful and obviously many people use it successfully. Iβm just not sure what I am doing wrong. After troubleshooting syntax for hours, I usually ask myself why I didnβt just use Excel (and I hate Excel). Do you all have any tips?