Accessing DataFrames - is there a simpler way?

dkzwart · April 23, 2021, 8:13pm

Matlab user here wondering if I am making Julia overcomplicated. I am importing numeric and string values from a database and looking for certain values to use. In Matlab, using Tables, my code looks like this:

Nsta = max([SP{"KPI_1",par}, SP{"KPI_2",par}])+1;

In Julia, my code is longer and more complex:

Nsta = maximum(filter(!isnan,[SP[findfirst(occursin.(["KPI_1"], SP.Name)),par], SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]])) + 1

I am accessing the DataFrame row using occursin to find the row name. I use findfirst to return a value rather than a vector with one value. And finally I filter out NaNs because if there is a NaN, maximum returns NaN, and I don’t want that. Is there a simpler, clearer way of accessing the DataFrame?

contradict · April 23, 2021, 10:12pm

Assuming your data looks like this:

julia> SP = DataFrame(:KPI_1=>[5,17,23,33,NaN],
                      :KPI_2=>[11, NaN, 22, 4,NaN])
5×2 DataFrame
 Row │ KPI_1    KPI_2
     │ Float64  Float64
─────┼──────────────────
   1 │     5.0     11.0
   2 │    17.0    NaN
   3 │    23.0     22.0
   4 │    33.0      4.0
   5 │   NaN      NaN

You could evaluate Nsta for a single row.

julia> max_skipnan(v...; allnans=0) = maximum(x->isnan(x) ? -Inf : x, v;
                                              init=allnans)
max_skipnan (generic function with 1 method)

julia> max_skipnan(SP.KPI_1[par], SP.KPI_2[par]) + 1
18.0

Or add a new column for Nsta like this:

julia> transform(SP,
 [:KPI_1, :KPI_2]=>((kpi1, kpi2)->max_skipnan.(kpi1, kpi2) .+ 1)=>:Nsta)
5×3 DataFrame
 Row │ KPI_1    KPI_2    Nsta
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │     5.0     11.0     12.0
   2 │    17.0    NaN       18.0
   3 │    23.0     22.0     24.0
   4 │    33.0      4.0     34.0
   5 │   NaN      NaN        1.0

pdeffebach · April 23, 2021, 10:40pm

You can index the row directly row["KPI_1"], so the findfirst stuff is definitely overkill.

I think it’s weird that Matlab skips NaNs here. One questions to ask is whether you want to work with NaN values or if you should use missing values instead. But granted, skipping over NaN (or missing) is a pain.

DataFramesMeta isn’t well suited for this use-case at the moment. But here is an alternate DataFrames version. You still have to make secondary function

julia> function maximum_nonan(t)
           t = filter(!isnan, collect(t))
           isempty(t) && return NaN
           return maximum(t)
           end
maximum_nonan (generic function with 1 method)

julia> transform(SP, AsTable([:KPI_1, :KPI_2]) => ByRow(maximum_nonan) => :h)
5×3 DataFrame
 Row │ KPI_1    KPI_2    h
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │     5.0     11.0     11.0
   2 │    17.0    NaN       17.0
   3 │    23.0     22.0     23.0
   4 │    33.0      4.0     33.0
   5 │   NaN      NaN      NaN

rafael.guerra · April 24, 2021, 12:17am

Using pipes:

using DataFrames

SP = DataFrame(:KPI_1=>[5,17,23,33,NaN], :KPI_2=>[11,NaN,22,4,NaN])

SP[!,[:KPI_1, :KPI_2]] |>
   x -> ifelse.(isnan.(x),-Inf,x) |>
   x -> maximum([x[:KPI_1] x[:KPI_2]], dims=2)[:] .+ 1 |>
   x -> SP.max_KPI12_plus1 = ifelse.(isinf.(x),NaN,x)

julia> SP
5×3 DataFrame
│ Row │ KPI_1   │ KPI_2   │ max_KPI12_plus1 │
│     │ Float64 │ Float64 │ Float64         │
├─────┼─────────┼─────────┼─────────────────┤
│ 1   │ 5.0     │ 11.0    │ 12.0            │
│ 2   │ 17.0    │ NaN     │ 18.0            │
│ 3   │ 23.0    │ 22.0    │ 24.0            │
│ 4   │ 33.0    │ 4.0     │ 34.0            │
│ 5   │ NaN     │ NaN     │ NaN             │

dkzwart · April 24, 2021, 11:15am

Thanks for the replies! I’m going to have to study them to understand what you are saying. But to clarify, KPI_1 and KPI_2 are ‘row names’ and par is a variable containing the column name. An example of the DataFrame is shown below, though there are many more columns than shown.
A simplified description of what I am trying to do is to return 2 if there is a 2 in the KPI_2 row, otherwise return 1. So I suppose I could simplify my code to:

if isnan(SP[findfirst(occursin.(["KPI_1"], SP.Name)),par]) 2 else 3 end

Example table:

 Row │ Item                               Name       Vel_B1_FTF  Vel_B1_BSF 
     │ String                             String     Float64     Float64    
─────┼──────────────────────────────────────────────────────────────────────
   1 │ Order of Component                 Order         NaN         NaN
   2 │ Search Range for highest compone…  searchLow      -0.025      -0.025
   3 │ Search Range for highest compone…  searchUpp       0.01        0.01
   4 │ Order wrt the component            KPI_1           1.0         1.0
   5 │ Order wrt the component            KPI_2         NaN           2.0
   6 │ Order wrt the component (Prev+1 …  KPI_N           5.0         5.0

I’ll go though what you guys have said, and see if I can get rid of that findfirst function.

pdeffebach · April 24, 2021, 12:10pm

This is an odd block of code. Why make a rowvector on the second block?

Why use the |> and anonymous functions when you could use Chain.jl?

pdeffebach · April 24, 2021, 12:31pm

Okay thanks for this. We were definitely mis-understanding what you wanted.

I think you want combine or transform with a GroupedDataFrame. I’m assuming you don’t just want the maximum of the KPI_1 and KPI_2 values, but rather want to make a transformation by group. This will do that

julia> using DataFramesMeta, Chain;

julia> df = DataFrame(
           Item = ["a", "a", "b", "b"],
           Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
           Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4×3 DataFrame
 Row │ Item    Name       Vel_B1_FTF 
     │ String  String     Float64    
─────┼───────────────────────────────
   1 │ a       searchLow       NaN
   2 │ a       searchUpp         0.1
   3 │ b       KPI_1             0.4
   4 │ b       KPI_2           NaN

julia> par = :Vel_B1_FTF;

julia> @chain df begin 
           groupby(:Item)
           @transform max_in_group = begin 
               x = filter(!isnan, cols(par))
               isempty(x) && return NaN
               maximum(x)
           end
       end
4×4 DataFrame
 Row │ Item    Name       Vel_B1_FTF  max_in_group 
     │ String  String     Float64     Float64      
─────┼─────────────────────────────────────────────
   1 │ a       searchLow       NaN             0.1
   2 │ a       searchUpp         0.1           0.1
   3 │ b       KPI_1             0.4           0.4
   4 │ b       KPI_2           NaN             0.4

If you really just want the maximum of KPI_1 and KPI_2, then yeah that’s tough.

julia> function get_max(df, vals, namevar, keys)
           x = df[!, vals]
           n = df[!, namevar]
           inds = findall(t -> t in keys, n)
           # handle empty collection as needed
           maximum(filter(!isnan, x[inds]))
       end;

julia> get_max(df, par, :Name, ["KPI_1", "KPI_2"])
0.4

So there are two things that make this tough in Julia, unfortunately. The first is the lack of row indexing, which you can do in Matlab I guess. The second is that Julia propagates NaNs in maximum while Matlab does not.

Propagation of NaNs in Julia is presumably governed by IEEE rules, so no changing that.

Additionally, maximum errors on an empty collection. This might actually be changeable. Maybe we should add a keyword argument for ifempty in maximum, similar to init in sum.

dkzwart · April 24, 2021, 1:14pm

KPI_1 will always be 1, KPI_2 will always be 2 or NaN. I want an output of 3 if KPI_2 is 2 or an output of 2 if KPI_2 is NaN. Because of how Matlab processes NaN’s, using max was convenient.

if isnan(SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]) 2 else 3 end does what I want. Is there any benefit to creating a function like your get_max above rather than using an if-else statement?

The reason for using the findfirst function is that the following line returns a 1-element Vector.

SP[occursin.(["KPI_2"], SP.Name),par]

isnan throws an error on that, and comparators don’t like comparing 1-element Vectors with numbers not in a vector.

Thinking about it, indexing to the only value in the response removes the need for the findfirst function. This may be about as simple a solution as I can come up with without row indexing in DataFrames.

if isnan(SP[occursin.(["KPI_2"], SP.Name),par][1]) Nsta = 2 else Nsta = 3 end

Or, more readable:

KPI_2_exists = !isnan(SP[occursin.(["KPI_2"], SP.Name),par][1])

if KPI_2_exists == 1
    Nsta = 3
else
    Nsta = 2
end

Looking at all the responses, it seems that the go-to solution for everything in Julia is to create a function. I’m used to minimize functions in code to keep things all together. Should I be looking to move towards using functions more frequently? Is a function more efficient than the above code?

pdeffebach · April 24, 2021, 1:21pm

Yes, definitely use a function. There are performance benefits to using functions in that Julia can infer types better when things are inside a function. Defining a function and then using it in right there is costless, and makes it easier to write code.

Looking at this code, I’m more convinced you want to be using missing instead of NaN. missing is designed to mean “unknown value”, whereas NaN is something different.

Yes, findfirst is better than occursin because it returns an element of the array. (findfirst returns an integer index, but occursin returns an array of Bools.

Here is another way of writing the above

julia> SP = DataFrame(
           Item = ["a", "a", "b", "b"],
           Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
           Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4×3 DataFrame
 Row │ Item    Name       Vel_B1_FTF 
     │ String  String     Float64    
─────┼───────────────────────────────
   1 │ a       searchLow       NaN
   2 │ a       searchUpp         0.1
   3 │ b       KPI_1             0.4
   4 │ b       KPI_2           NaN

julia> KPI_2_exists = let
           ind = findfirst(==("KPI_2"), SP.Name)
           !isnan(SP[!,par][ind])
       end;

julia> Nsta = KPI_2_exists ? 3 : 2
2

dkzwart · April 26, 2021, 3:14pm

Whether the unknown values are missing or NaN is a function of how Julia imports blanks, but I can modify that after I import it. But Julia treats both similarly in comparisons, as far as I can tell.

julia> KPI_2_exists = let
           ind = findfirst(==("KPI_2"), SP.Name)
           !isnan(SP[!,par][ind])
       end;

would be equivalent to

ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[!,par][ind])

or even

ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[ind,par])

except the ind variable disappears in the “let” case?

pdeffebach · April 26, 2021, 3:19pm

They should all be equivalent, yeah. Be careful about !isnan(SP[!,par][ind], though. That’s fast, but !isnan(SP[:, par][ind]) will be slow (the : allocates a new vector).

It’s good practice to use let for these multi-line blocks so you don’t clutter your namespace with temporary variables.

dkzwart · April 26, 2021, 3:39pm

Thanks! I was a little confused about the difference between the ! and : punctuations for DataFrames. And the difference between creating new variables and working on existing ones. I’m a little closer now

Topic		Replies	Views
Index of max values of N columns in a DataFrame without manually entering each column New to Julia question , indexing , dataframes , maxima	15	1512	January 7, 2023
Simplifying this dataframe aggregation operation General Usage	19	2265	September 9, 2018
Row wise median for julia dataframes Data dataframes	18	642	November 30, 2023
Get the maximum value by row in a DataFrame containing missing values Data statistics , benchmark , dataframes , missing-values	10	170	August 30, 2024
Iterate over all numeric columns in DataFrames Data	21	4857	February 11, 2018

Accessing DataFrames - is there a simpler way?

Related topics