Accessing DataFrames - is there a simpler way?

Matlab user here wondering if I am making Julia overcomplicated. I am importing numeric and string values from a database and looking for certain values to use. In Matlab, using Tables, my code looks like this:

Nsta = max([SP{"KPI_1",par}, SP{"KPI_2",par}])+1;

In Julia, my code is longer and more complex:

Nsta = maximum(filter(!isnan,[SP[findfirst(occursin.(["KPI_1"], SP.Name)),par], SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]])) + 1

I am accessing the DataFrame row using occursin to find the row name. I use findfirst to return a value rather than a vector with one value. And finally I filter out NaNs because if there is a NaN, maximum returns NaN, and I don’t want that. Is there a simpler, clearer way of accessing the DataFrame?

Assuming your data looks like this:

julia> SP = DataFrame(:KPI_1=>[5,17,23,33,NaN],
                      :KPI_2=>[11, NaN, 22, 4,NaN])
5Γ—2 DataFrame
 Row β”‚ KPI_1    KPI_2
     β”‚ Float64  Float64
─────┼──────────────────
   1 β”‚     5.0     11.0
   2 β”‚    17.0    NaN
   3 β”‚    23.0     22.0
   4 β”‚    33.0      4.0
   5 β”‚   NaN      NaN

You could evaluate Nsta for a single row.

julia> max_skipnan(v...; allnans=0) = maximum(x->isnan(x) ? -Inf : x, v;
                                              init=allnans)
max_skipnan (generic function with 1 method)

julia> max_skipnan(SP.KPI_1[par], SP.KPI_2[par]) + 1
18.0

Or add a new column for Nsta like this:

julia> transform(SP,
 [:KPI_1, :KPI_2]=>((kpi1, kpi2)->max_skipnan.(kpi1, kpi2) .+ 1)=>:Nsta)
5Γ—3 DataFrame
 Row β”‚ KPI_1    KPI_2    Nsta
     β”‚ Float64  Float64  Float64
─────┼───────────────────────────
   1 β”‚     5.0     11.0     12.0
   2 β”‚    17.0    NaN       18.0
   3 β”‚    23.0     22.0     24.0
   4 β”‚    33.0      4.0     34.0
   5 β”‚   NaN      NaN        1.0
2 Likes

You can index the row directly row["KPI_1"], so the findfirst stuff is definitely overkill.

I think it’s weird that Matlab skips NaNs here. One questions to ask is whether you want to work with NaN values or if you should use missing values instead. But granted, skipping over NaN (or missing) is a pain.

DataFramesMeta isn’t well suited for this use-case at the moment. But here is an alternate DataFrames version. You still have to make secondary function

julia> function maximum_nonan(t)
           t = filter(!isnan, collect(t))
           isempty(t) && return NaN
           return maximum(t)
           end
maximum_nonan (generic function with 1 method)

julia> transform(SP, AsTable([:KPI_1, :KPI_2]) => ByRow(maximum_nonan) => :h)
5Γ—3 DataFrame
 Row β”‚ KPI_1    KPI_2    h
     β”‚ Float64  Float64  Float64
─────┼───────────────────────────
   1 β”‚     5.0     11.0     11.0
   2 β”‚    17.0    NaN       17.0
   3 β”‚    23.0     22.0     23.0
   4 β”‚    33.0      4.0     33.0
   5 β”‚   NaN      NaN      NaN
2 Likes

Using pipes:

using DataFrames

SP = DataFrame(:KPI_1=>[5,17,23,33,NaN], :KPI_2=>[11,NaN,22,4,NaN])

SP[!,[:KPI_1, :KPI_2]] |>
   x -> ifelse.(isnan.(x),-Inf,x) |>
   x -> maximum([x[:KPI_1] x[:KPI_2]], dims=2)[:] .+ 1 |>
   x -> SP.max_KPI12_plus1 = ifelse.(isinf.(x),NaN,x)

julia> SP
5Γ—3 DataFrame
β”‚ Row β”‚ KPI_1   β”‚ KPI_2   β”‚ max_KPI12_plus1 β”‚
β”‚     β”‚ Float64 β”‚ Float64 β”‚ Float64         β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5.0     β”‚ 11.0    β”‚ 12.0            β”‚
β”‚ 2   β”‚ 17.0    β”‚ NaN     β”‚ 18.0            β”‚
β”‚ 3   β”‚ 23.0    β”‚ 22.0    β”‚ 24.0            β”‚
β”‚ 4   β”‚ 33.0    β”‚ 4.0     β”‚ 34.0            β”‚
β”‚ 5   β”‚ NaN     β”‚ NaN     β”‚ NaN             β”‚
1 Like

Thanks for the replies! I’m going to have to study them to understand what you are saying. But to clarify, KPI_1 and KPI_2 are β€˜row names’ and par is a variable containing the column name. An example of the DataFrame is shown below, though there are many more columns than shown.
A simplified description of what I am trying to do is to return 2 if there is a 2 in the KPI_2 row, otherwise return 1. So I suppose I could simplify my code to:

if isnan(SP[findfirst(occursin.(["KPI_1"], SP.Name)),par]) 2 else 3 end

Example table:

 Row β”‚ Item                               Name       Vel_B1_FTF  Vel_B1_BSF 
     β”‚ String                             String     Float64     Float64    
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚ Order of Component                 Order         NaN         NaN
   2 β”‚ Search Range for highest compone…  searchLow      -0.025      -0.025
   3 β”‚ Search Range for highest compone…  searchUpp       0.01        0.01
   4 β”‚ Order wrt the component            KPI_1           1.0         1.0
   5 β”‚ Order wrt the component            KPI_2         NaN           2.0
   6 β”‚ Order wrt the component (Prev+1 …  KPI_N           5.0         5.0

I’ll go though what you guys have said, and see if I can get rid of that findfirst function.

This is an odd block of code. Why make a rowvector on the second block?

Why use the |> and anonymous functions when you could use Chain.jl?

Okay thanks for this. We were definitely mis-understanding what you wanted.

I think you want combine or transform with a GroupedDataFrame. I’m assuming you don’t just want the maximum of the KPI_1 and KPI_2 values, but rather want to make a transformation by group. This will do that

julia> using DataFramesMeta, Chain;

julia> df = DataFrame(
           Item = ["a", "a", "b", "b"],
           Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
           Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4Γ—3 DataFrame
 Row β”‚ Item    Name       Vel_B1_FTF 
     β”‚ String  String     Float64    
─────┼───────────────────────────────
   1 β”‚ a       searchLow       NaN
   2 β”‚ a       searchUpp         0.1
   3 β”‚ b       KPI_1             0.4
   4 β”‚ b       KPI_2           NaN

julia> par = :Vel_B1_FTF;

julia> @chain df begin 
           groupby(:Item)
           @transform max_in_group = begin 
               x = filter(!isnan, cols(par))
               isempty(x) && return NaN
               maximum(x)
           end
       end
4Γ—4 DataFrame
 Row β”‚ Item    Name       Vel_B1_FTF  max_in_group 
     β”‚ String  String     Float64     Float64      
─────┼─────────────────────────────────────────────
   1 β”‚ a       searchLow       NaN             0.1
   2 β”‚ a       searchUpp         0.1           0.1
   3 β”‚ b       KPI_1             0.4           0.4
   4 β”‚ b       KPI_2           NaN             0.4

If you really just want the maximum of KPI_1 and KPI_2, then yeah that’s tough.

julia> function get_max(df, vals, namevar, keys)
           x = df[!, vals]
           n = df[!, namevar]
           inds = findall(t -> t in keys, n)
           # handle empty collection as needed
           maximum(filter(!isnan, x[inds]))
       end;

julia> get_max(df, par, :Name, ["KPI_1", "KPI_2"])
0.4

So there are two things that make this tough in Julia, unfortunately. The first is the lack of row indexing, which you can do in Matlab I guess. The second is that Julia propagates NaNs in maximum while Matlab does not.

Propagation of NaNs in Julia is presumably governed by IEEE rules, so no changing that.

Additionally, maximum errors on an empty collection. This might actually be changeable. Maybe we should add a keyword argument for ifempty in maximum, similar to init in sum.

2 Likes

KPI_1 will always be 1, KPI_2 will always be 2 or NaN. I want an output of 3 if KPI_2 is 2 or an output of 2 if KPI_2 is NaN. Because of how Matlab processes NaN’s, using max was convenient.

if isnan(SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]) 2 else 3 end does what I want. Is there any benefit to creating a function like your get_max above rather than using an if-else statement?

The reason for using the findfirst function is that the following line returns a 1-element Vector.

SP[occursin.(["KPI_2"], SP.Name),par]

isnan throws an error on that, and comparators don’t like comparing 1-element Vectors with numbers not in a vector.

Thinking about it, indexing to the only value in the response removes the need for the findfirst function. This may be about as simple a solution as I can come up with without row indexing in DataFrames.

if isnan(SP[occursin.(["KPI_2"], SP.Name),par][1]) Nsta = 2 else Nsta = 3 end

Or, more readable:

KPI_2_exists = !isnan(SP[occursin.(["KPI_2"], SP.Name),par][1])

if KPI_2_exists == 1
    Nsta = 3
else
    Nsta = 2
end

Looking at all the responses, it seems that the go-to solution for everything in Julia is to create a function. I’m used to minimize functions in code to keep things all together. Should I be looking to move towards using functions more frequently? Is a function more efficient than the above code?

Yes, definitely use a function. There are performance benefits to using functions in that Julia can infer types better when things are inside a function. Defining a function and then using it in right there is costless, and makes it easier to write code.

Looking at this code, I’m more convinced you want to be using missing instead of NaN. missing is designed to mean β€œunknown value”, whereas NaN is something different.

Yes, findfirst is better than occursin because it returns an element of the array. (findfirst returns an integer index, but occursin returns an array of Bools.

Here is another way of writing the above

julia> SP = DataFrame(
           Item = ["a", "a", "b", "b"],
           Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
           Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4Γ—3 DataFrame
 Row β”‚ Item    Name       Vel_B1_FTF 
     β”‚ String  String     Float64    
─────┼───────────────────────────────
   1 β”‚ a       searchLow       NaN
   2 β”‚ a       searchUpp         0.1
   3 β”‚ b       KPI_1             0.4
   4 β”‚ b       KPI_2           NaN

julia> KPI_2_exists = let
           ind = findfirst(==("KPI_2"), SP.Name)
           !isnan(SP[!,par][ind])
       end;

julia> Nsta = KPI_2_exists ? 3 : 2
2
1 Like

Whether the unknown values are missing or NaN is a function of how Julia imports blanks, but I can modify that after I import it. But Julia treats both similarly in comparisons, as far as I can tell.

julia> KPI_2_exists = let
           ind = findfirst(==("KPI_2"), SP.Name)
           !isnan(SP[!,par][ind])
       end;

would be equivalent to

ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[!,par][ind])

or even

ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[ind,par])

except the ind variable disappears in the β€œlet” case?

They should all be equivalent, yeah. Be careful about !isnan(SP[!,par][ind], though. That’s fast, but !isnan(SP[:, par][ind]) will be slow (the : allocates a new vector).

It’s good practice to use let for these multi-line blocks so you don’t clutter your namespace with temporary variables.

1 Like

Thanks! I was a little confused about the difference between the ! and : punctuations for DataFrames. And the difference between creating new variables and working on existing ones. I’m a little closer now :slight_smile:

1 Like