Frustrated using DataFrames

Nathan_Boyer · September 7, 2021, 9:38pm

Overview

DataFrames.jl seems to be the preferred way to work with tabular data in Julia, but it never does what I want the first time. The syntax I need to contrive to do things I think would be simple is mind-boggling, especially when I try to explain what my code is doing to non-coders. It makes me wonder if there is a different package I should be using for my typical workflow, since the things I want to do often are so hard. I am usually joining CSVs together, filtering, sorting, and computing some new columns. These all sound like functions provided by DataFrames, but I don’t think I do it as explained in The Split-Apply-Combine Strategy.

I have read the documentation and done the tutorial on Github. I nodded along as I went through, but using DataFrames on my own projects has been a different experience. I always think “No problem, I will whip this up in 10 minutes.” but an hour later am still sifting through error messages and documentation.

Examples

Code for Example 3

julia> using DataFrames

julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      0
   2 │     2      0      0
   3 │     3      9     12
   4 │     4      5      8

Trying to use the previous StackOverflow answers as a guide:

julia> filter!(names(df, Not(:node)) .=> ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching !(::Vector{Pair{String, ByRow{var"#204#206"}}})
Closest candidates are:
  !(::Bool) at bool.jl:33
  !(::Function) at operators.jl:968
  !(::Missing) at missing.jl:101
Stacktrace:
 [1] filter!(f::Vector{Pair{String, ByRow{var"#204#206"}}}, df::DataFrame)
   @ DataFrames C:\Users\nboyer.AIP\.julia\packages\DataFrames\vuMM8\src\abstractdataframe\abstractdataframe.jl:1127
 [2] top-level scope
   @ REPL[106]:1

This works but destroys df:

julia> filter!(row -> any(x -> x>0, row), df[!,Not(:node)])
3×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      0
   2 │     9     12
   3 │     5      8

julia> df
Error showing value of type DataFrame:
ERROR: AssertionError: Data frame is corrupt: length of column :x (3) does not match length of column 1 (4). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).
Stacktrace:
 [1] _check_consistency(df::DataFrame)
   @ DataFrames C:\Users\nboyer.AIP\.julia\packages\DataFrames\vuMM8\src\dataframe\dataframe.jl:447

Finally, after much frustration, this works as expected (nested anonymous functions with a tuple splat that is part of a pair using no broadcasting dots):

julia> filter!(Not(:node) => (row...) -> any(x -> x>0, row), df)
3×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      0
   2 │     3      9     12
   3 │     4      5      8

Final Thoughts

Perhaps the most frustrating part is that I am unable to reason my way through the error messages. It is usually impossible to pare the syntax problem down to something smaller and make sure the pieces that make up args are passing the types and data structures I expect (row, column, vector, scalar, etc.). You have to get it all right in one go, or you will just get a random method error somewhere.

I am not trying to complain that the package is bad. It seems very powerful and obviously many people use it successfully. I’m just not sure what I am doing wrong. After troubleshooting syntax for hours, I usually ask myself why I didn’t just use Excel (and I hate Excel). Do you all have any tips?

pdeffebach · September 7, 2021, 9:45pm

These are valid concerns, but why not use DataFramesMeta.jl? It aims to solve exactly these problems. The syntax for DataFrames transformations is definitely at times, but that is why DataFramesMeta exists and is overseen by the core contributors of DataFrames.jl.

Admittedly, some of these problems are not easily solved by DataFramesMeta, particularly cases where you need to work with many columns at once. But there is nothing stating you have to do everything in a single transform call. Some of these can be done in a for loop. Instead of

transform(df, names(df, r"Temp") .=> fun; rename = false)

do

celsius_names = names(df, r"Temp")
for cname in celsius_name
    transform!(...)
end

juliohm · September 7, 2021, 9:45pm

I rarely use DataFrames.jl directly because it is too low-level, I second the suggestion by @pdeffebach. DataFramesMeta.jl is clean, specially now that it has row-based operations like @rtransform etc.

Nathan_Boyer · September 7, 2021, 9:47pm

… because it sounds scary. Is that not Advanced DataFrames?

jzr · September 7, 2021, 9:48pm

I prefer to avoid macro libraries myself (DataFramesMeta, DFMacros) because I feel they confuse more than they help if the functional API is well designed.

The solution to this one is filter(row->(any(ismissing,row), df, view=true). DataFrames.jl differs from every other part of Julia in that it copies by default, because it is aimed at beginners who might not understand the difference. Personally I think that’s not a good approach for a high-performance library. I agree with you there.

I would say

transform(df, :top => ByRow(ftoc) => :topc, :bottom => ByRow(ftoc) => :bottomc)

This seems fine to me, unless you want to retain each column name, in which case it’s annoying. Another function could do that job more easily, specifically for changing the value while retaining the name.

filter(row->all(==(0), row), df)
filter(row->all(==(0), row), df, view=true)

Seems ok, with the caveat about view.

The only macro I do want is Chain.jl, because it’s so simple and universal.

pdeffebach · September 7, 2021, 9:50pm

No. It’s designed to be simpler. My ideal audience is probably a first year MPP student doing data analysis for the first time. Maybe work through this tutorial and see if it’s for you.

But as noted above in an edit, for some of these programmatic ones, working with many rows at once, a for loop with DataFramesMeta is what I would turn to.

One other thought, if you are using Chain.jl for piping, which is exported by DataFramesMeta, the @aside macro is suuuuuppper (super!) useful for mixing dplyr-esque split-apply-combine and more programming-based operations.

pdeffebach · September 7, 2021, 9:53pm

I would disagree with this, take a look at the example OP provided of a scary error

julia> filter!(row -> any(x -> x>0, row), df[!,Not(:node)])
3×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      0
   2 │     9     12
   3 │     5      8

julia> df
Error showing value of type DataFrame:
ERROR: AssertionError: Data frame is corrupt: length of column :x (3) does not match length of column 1 (4). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

This error arises because df[!, Not(:node)] is not the same object as df, but shares the same columns. The filter! operation modifies those arrays, leading to corruption of df. This is a direct consequence of need-for-speed non-copying.

@Nathan_Boyer my advice: only use : when indexing, instead of ! and only use copying versions of functions, i.e. filter instead of filter, @rtransform instead of @rtransform! (from DataFramesMeta).

jzr · September 7, 2021, 9:54pm

I don’t use mutating calls in dataframes except when forced by some particular unusual constraint. They confuse the program state, introducing inconsistencies as shown above.

pdeffebach · September 7, 2021, 9:57pm

I agree to some extent (though if you are coming from Stata, having a single object which is modified is more natural). This certainly seems to be the case for OP. But DataFrames.jl provides the option to have maximum performance. It caters to both demographics.

jzr · September 7, 2021, 9:58pm

Copying-style functions are so different in semantics and performance from view-style functions that they really shouldn’t be arguments on the same function, but two families of functions.

Having to write view=true and copycols=false everywhere doesn’t really suit my preference for concision either.

johnmyleswhite · September 7, 2021, 10:05pm

This claim is unclear to me – I would say one of the most striking differences between Julia and Python/NumPy is that Julia almost always copies arrays by default whereas Python/NumPy almost never does. What did you have in mind?

jzr · September 7, 2021, 10:20pm

Thinking of dataframes as a collection of collections, it would be unusual for a function that replaces the outer collection to also replace the inner collections. For example, consider a function that adds a key to a dict of vectors. transform(df, :a => ByRow(identity) => :b) allocates a new vector for each column a:z. It would be more typical to reuse the unchanged columns by reference. Likewise, filtering a vector of dicts wouldn’t allocate new dicts.

xiaodai · September 7, 2021, 11:41pm

Perhaps this video would help https://www.youtube.com/watch?v=rDvpLFxcL84

oxinabox · September 8, 2021, 12:07am

Sliced indexing on arrays in Julia always returns a copy.

Similar filter on arrays returns a newly allocated array not a view (unless you use Iterators.filter)

jzr · September 8, 2021, 12:14am

That’s not what happens here. DataFrames.jl filter allocates new contents, not just new DataFrames. Likewise, transform allocates new column vectors for unchanged columns.

Edit: I see the source of confusion. Some people are thinking of dataframes as similar to matrices of values while I’m thinking of them as collections of collections. Copying a matrix is different from copying a vector of dicts or a dict of vectors: copying a matrix copies deeper. See the copycols argument of transform for detail.

oxinabox · September 8, 2021, 9:41am

This was once a matter of opinion. It is not any more.
One of the things done during the DataFrame stabalization process was basically to cement into dataframe’s concept that it was its own 2D type.
It wasn’t a collection of columns (or a collection of rows).
It’s also not a “matrix” because the semantics of a matrix largely only make sense for homogeneous types.
(Other Table.jl table types exist and more can be created, which happily can hold the opinion that they are just a collecton of columns or of rows. And they will work happily with e.g. Query.jl, CSV.jl, LibPQ.jl etc, and be trivially convertable to and from DataFrames.)

Once one takes the view that it is its own collection type, and not a wrapper,
it is now deducable that it should copy:
From the docs on mutable collection constructors (in this case in contrast to convert):

… if T is a mutable collection type then T(x) should always make a new collection (copying elements from x ).

A lot (all) of DataFrame’s copies are boiling down to ensuring they are safe.
It used to be very easy to corrupt a DataFrame by ending up sharing columns with another DataFrame and then call unique! and now ended up with a corrupted dataframe with columns of different lengths.

mcreel · September 8, 2021, 10:27am

I am no DataFrames expert, but I have come to like it, as it allows me to reproduce my data preparation routines, rather than hacking together a solution. The manipulation tools take a little learning, but once a few important ones are learned, I find I can get done the things I need to do.

For Example3, a solution is

using DataFrames, DataFramesMeta
df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
@subset!(df, ((:x .!= 0) .| (:y .!= 0)))

For my own learning, I also looked at Example2, and came up with

using DataFrames, DataFramesMeta

fahrenheit_to_celsius(x) = Int(round((x - 32) * 5 / 9))

df= DataFrame(Time = [3, 4, 5], TopTemp = [70, 73, 100], BottomTemp = [50, 55, 80])

[df[!,v] = fahrenheit_to_celsius.(df[!,v]) for v in (:TopTemp, :BottomTemp)]

KajWiik · September 8, 2021, 12:43pm

I noticed that DataFramesMeta @subset is about ten times slower than direct indexing (OK, this is a small DataFrame):

julia> using DataFrames, DataFramesMeta
julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
julia> @benchmark @subset(df, ((:x .!= 0) .| (:y .!= 0)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.613 μs …  6.984 ms  ┊ GC (min … max): 0.00% … 99.12%
 Time  (median):     47.490 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.393 μs ± 69.472 μs  ┊ GC (mean ± σ):  1.40% ±  0.99%

     ▄█▆▂                                                      
  ▁▃▇████▆▄▃▂▂▂▂▂▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  45.6 μs         Histogram: frequency by time        65.2 μs <

 Memory estimate: 9.67 KiB, allocs estimate: 167.

julia> @benchmark df[(df.x .!= 0) .| (df.y .!= 0), :]
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.154 μs … 604.375 μs  ┊ GC (min … max): 0.00% … 98.70%
 Time  (median):     3.387 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.724 μs ±  10.293 μs  ┊ GC (mean ± σ):  4.76% ±  1.71%

      ▄██▆▁                                                    
  ▁▁▂▆█████▆▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  3.15 μs         Histogram: frequency by time           5 μs <

 Memory estimate: 1.70 KiB, allocs estimate: 22.

I guess it is OK to use direct indexing?

CameronBieganek · September 8, 2021, 1:11pm

This seems to be an unreliable heuristic, even in Base.

Counterexample 1

julia> x = "a" => [1, 2];

julia> d = Dict(x)
Dict{String, Vector{Int64}} with 1 entry:
  "a" => [1, 2]

julia> x[2][2] = 100
100

julia> d
Dict{String, Vector{Int64}} with 1 entry:
  "a" => [1, 100]

Counterexample 2

julia> using LinearAlgebra

julia> m = [1 2
            3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> sm = Symmetric(m)
2×2 Symmetric{Int64, Matrix{Int64}}:
 1  2
 2  4

julia> m[1, 2] = 100;

julia> sm
2×2 Symmetric{Int64, Matrix{Int64}}:
   1  100
 100    4

I agree with @jzr that it seems that Julia leans more toward not copying than toward copying.

Palli · September 8, 2021, 1:13pm

I admit I’ve never used DataFramesMeta.jl either, overlooked, thought not needed. I hope it resolves all your issues. If not there’s also Pandas.jl. Are there other conpeting packages? Maybe Query[verse].jl I’m not up-to-speed on.

I would hate for people to migrate to Python totally, based on a misunderstanding, or even partially to Python’s Pandas (it used PyCall.jl) if not needed. Many people are happy with Pandas, it’s an “all-in-one” package I understand, more of a think in Python ecosystem.

Should DataFramesMeta.jl be merged into DataFrames.jl (or added as its dependency)? I don’t know the history of it, but I at least assumed DataFrames.jl to be Julia’s go to package.

My PR to Pandas.jl got recently merged (after a long time, making it slightly better), maybe it’s not updated often, at least not as much as DF.jl, could be because it’s a wrapper, but I’m still not sure it covers all that Pandas can do.