Frustrated using DataFrames

bkamins · September 8, 2021, 1:48pm

If you want an in-place operation I would write it:

filter!(row -> !all(x -> x == 0, row[Not(:node)]), df)

and I find it a natural way to do it, as it says:

for each row drop :node column and for all other columns if they pass the all test drop the row

Two additional comments:

As @pdeffebach said - DataFramesMeta.jl is intended as an entry level package covering most of simple operations more easily.
DataFrames.jl gives one a full control over all natural options what you could want to do with data (copy, in-place update, view), but since we give the user full control this requires learning. Therefore the “default mode” is to perform a copy as it is safest as it was commented in the answers above.

Having said that I think the crucial thing is what @Nathan_Boyer started with - it would be great if you would expand on what you would find helpful to learn to handle the DataFrames.jl efficiently (as then we would add it to the package).

Nathan_Boyer · September 8, 2021, 1:50pm

It seems like Pandas.jl tries to replicate Python style and DataFrames.jl tries to replicate R style. I don’t have much experience with Python or R, so those similarities are of no benefit to me.

I typically have some CSV data files which are joined into one DataFrame by some ID column. A given ID may have many columns of data associated with it, so a lot of my operations are multi-column e.g. operate on all columns but ID or operate on all columns with “temp” in the header. This seems to complicate the required syntax.

I do prefer to mutate a single DataFrame in place, adding columns as necessary. It keeps my workspace clean and avoids me having to rename stuff and keep track of multiple DataFrames. I like it mostly for organization, but speed is a great benefit too.

mbauman · September 8, 2021, 1:50pm

Oooh, that’s a tricky one. I’d say that your counterexamples 1 ~~and 3~~ aren’t actually counter examples. The text of the docs is fairly confusing: we should probably strike that “copying the elements” parenthetical. It’s not clear if that means calling copy(elt) on each element or if it’s just putting the elements into a new collection? The intention is the latter.

Julia doesn’t deep copy, almost ever. And creating new names for the same object doesn’t make copies, either. And putting that object inside new collections doesn’t copy. But taking one collection of type T and calling T(collection) on it should make a copy (of the outer collection).

bkamins · September 8, 2021, 1:57pm

Indeed currently in some cases it does. Add support for All, Between and Not broadcating · Issue #2171 · JuliaData/DataFrames.jl · GitHub and Fast row aggregation in DataFrames.jl · Issue #2768 · JuliaData/DataFrames.jl · GitHub are aimed to fix this case (one is for simplicity, the other is for performance). They will both land in 1.3 release.

Doing in-place operations is fully supported. It just requires more care when columns are aliased between data frames.

oxinabox · September 8, 2021, 1:57pm

That is fair, and does kinda disgree with my claim that this behavour is natural.
but not entirely.

I think in particular the reason is that a dataframe is its own collection.
How it represent that data it its business.
And the new allocations are made to hold the elements of the passed in vectors.
The fact that those allocations happen to be copies of the original vectors is an implementation detail (or at least an advanced API feature).
Just like if i call Set([1, 2, 3]) A vector will infact be allocated behind the hood (two infact) which will be filled with the contents of the [1,2,3] – an even betteer example might be OrderedSet which will have multiple vectors behind it, one of which will in this case be exactly the same content as [1,2,3] in the same order.

pdeffebach · September 8, 2021, 2:00pm

Chain.jl (again, re-exported by DataFramesMeta.jl) might give you the convenience of mutating operations without some of the safety pitfalls you’ve been seeing.

bkamins · September 8, 2021, 2:02pm

This is exactly the intended design:

entry level user: DataFrame is a black box that is designed to be safe to be used; just remember not to ever use ! on your keyboard while you are an entry level user
advanced user: you can do all you want (and there is an API for this) however at your own risk

Nathan_Boyer · September 8, 2021, 2:26pm

I am sure DataFrames is capable of everything I need it to do. My frustration comes from figuring out what is and isn’t needed/allowed inside the function syntax: ByRow(), eachcol(), names(), Cols(), :, .=>, .(), ... etc.

For instance, I did not realize that you can index rows with column selectors inside the function definition.

Nathan_Boyer · September 8, 2021, 2:27pm

I will look into Chains.jl. It should be noted that Example 3b (which caused the data corruption) was not my first instinct. I just started getting creative once 3a wasn’t working.

dlakelan · September 8, 2021, 2:40pm

For rather complicated things, sometimes a special purpose language is useful. Such as SQL. In which case you can try SQLiteDF which is a package I wrote. It will let you hand it a few data frames, and then load them into SQLite and run a particular query against them. It’s not going to be super fast necessarily, but it’ll get the job done.

pdeffebach · September 8, 2021, 2:49pm

We could maybe make tools to help users check the inputs. If you do, for example,

eachcol(df) => fun => AsTable

this won’t work, but it’s not obvious why (eachcol is an iterator of actual columns, not column names).

But maybe we can do something like check_valid_transform to separate the construction of the source => fun => dest call from the actual operation.

mthelm85 · September 8, 2021, 3:12pm

Another vote for this. I use it in a really large portion of my projects and I’ve never really felt frustrated with syntax at all. Quite the opposite actually - I use the @chain macro (previously @linq) all the time and find that it makes my code very clean and easy to understand by others as well as by myself when I come back to it at a later date since I usually perform one operation per line.

aplavin · September 8, 2021, 3:32pm

I was long frustrated by dataframes - not DataFrames.jl specifically, just dataframes in general, in any language. It’s often much more natural and convenient to represent data as something other than a flat 2d table of plain values.

Languages like Python are slow and working with arbitrary objects quickly becomes infeasible - that’s one reason dataframes are required there. But not in Julia! Built-in data structures like vectors, tuples, dicts, and custom problem-specific structs work together without any issues and slowdowns. They smoothly operate with the Tables.jl ecosystem, and with other packages such as SplitApplyCombine.jl.

Depending on your usecases, built-in structures can outperform DataFrames.jl. Syntax for common operations is reasonably convenient as-is, and can be made even smoother with a thin macro package - see [ANN] DataPipes.jl - #41 by aplavin for an example.

Nathan_Boyer · September 8, 2021, 3:57pm

I was wondering about this myself: if I would be better off saving each column as a vector variable or creating some custom struct instead of a DataFrame. However, sorting, filtering, and printing as a table all seem pretty inherent to the DataFrame structure. Can you show how you would implement some of my examples?

aplavin · September 8, 2021, 4:10pm

It’s easy to work with tables using only Base julia structures - vectors of namedtuples. Examples from your first post, untested, typos and silly errors are likely.

Select rows with missings.

using DataPipes

tbl = [(a=…, b=…), …]
@p tbl |> filter(any(ismissing, _))

Convert temperatures.

tbl = [(top=1.0, bottom=0.5, label=…), …]
@p tbl |> mutate(top=farenheit_to_celsius(_.top), bottom=farenheit_to_celsius(_.bottom))

This is an example where a nested structure is more natural:

tbl = [(label=…, temps=(0.5, 1.0)), …]
@p tbl |> mutate(temps=farenheit_to_celsius.(_.temps))

Select all-zero rows: same as above, @p tbl |> filter(all(==(0), _)).

lmiq · September 8, 2021, 4:20pm

StructArrays is also a neat possibility in many cases:

julia> using StructArrays

julia> struct Data
           name::String
           weight::Float64
       end

julia> data = StructArray([ Data("Maguila",130.0), Data("Tyson",120.0) ])
2-element StructArray(::Vector{String}, ::Vector{Float64}) with eltype Data:
 Data("Maguila", 130.0)
 Data("Tyson", 120.0)

julia> kg_to_g(x) = 1000*x
kg_to_g (generic function with 1 method)

julia> data.weight .= kg_to_g.(data.weight)
2-element Vector{Float64}:
 130000.0
 120000.0

julia> data
2-element StructArray(::Vector{String}, ::Vector{Float64}) with eltype Data:
 Data("Maguila", 130000.0)
 Data("Tyson", 120000.0)

julia> data = filter(x -> x.weight < 125000, data)
1-element StructArray(::Vector{String}, ::Vector{Float64}) with eltype Data:
 Data("Tyson", 120000.0)

(inplace filter! does not work, though).

viraltux · September 8, 2021, 5:34pm

Answers

1. What rows are missing data?

using DataFrames, SQLdf

df = DataFrame(node=1:4, x=[1,missing,9,5], y=[0,0,missing,8])

@sqldf "select * from df where x is NULL or y is NULL"

2×3 DataFrame
 Row │ node   x        y       
     │ Int64  Int64?   Int64?  
─────┼─────────────────────────
   1 │     2  missing        0
   2 │     3        9  missing

2. Convert temperature data to Celsius.

df = DataFrame(node=1:4, x=[1,0,9,5], F=[0,0,12,8])

@sqldf "select * , (F-32)/1.8 as Celsius from df"

4×4 DataFrame
 Row │ node   x      F      Celsius  
     │ Int64  Int64  Int64  Float64  
─────┼───────────────────────────────
   1 │     1      1      0  -17.7778
   2 │     2      0      0  -17.7778
   3 │     3      9     12  -11.1111
   4 │     4      5      8  -13.3333

3. Filter out rows which are all zeros.

df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])

@sqldf "select * from df where x !=0 and y != 0"

2×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     3      9     12
   2 │     4      5      8

pdeffebach · September 8, 2021, 5:55pm

One thing to note in all of these replies is OP specifically asked for working with many columns programatically, i.e. transform(df, names(df, r"Temp") .=> ...).

As far as I know, none of the more specialized packages listed head make that easy, including DataFramesMeta.jl. DataFrames’s mini-language is the best at making many modifications in one call.

Nathan_Boyer · September 8, 2021, 6:13pm

Yes, it seems like all the suggestions so far would still require a for loop.

for name in names(df, r"Temp")
  # Transform one column at a time (corrected)
end

It would be nice to be able to broadcast the transformations instead. (I’m not sure if even the former is possible with SQL since it parses a big string.) However, I appreciate seeing these other packages in action nonetheless.

pdeffebach · September 8, 2021, 6:15pm

Not one row at a time, one column at a time.

I don’t see any big issue with using a for loop. The only issue is that you have to use mutating functions, i.e. @rtransform!, which are less safe.

But in general, something like

for name in names(df, r"Temp")
    @rtransform! df $name = fahrenheit_to_celsius($name)
end

is very Stata-esque. And Stata is a battle tested data-cleaning framework that a lot of people enjoy, so it’s not that bad.

Topic		Replies	Views
Problems about dealing with missing values, maybe connected to DataFrames.jl Data question	4	759	December 4, 2018
Query.jl with filtering by missing values doesn't seem to work? General Usage	8	1650	January 23, 2018
Problem with package TableView / Julia usage New to Julia	12	1466	May 27, 2020
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017
Nullables - why? and how? New to Julia	6	2454	December 19, 2017

Frustrated using DataFrames

Related topics