# Accessing DataFrames - is there a simpler way?

Matlab user here wondering if I am making Julia overcomplicated. I am importing numeric and string values from a database and looking for certain values to use. In Matlab, using Tables, my code looks like this:

`Nsta = max([SP{"KPI_1",par}, SP{"KPI_2",par}])+1;`

In Julia, my code is longer and more complex:

`Nsta = maximum(filter(!isnan,[SP[findfirst(occursin.(["KPI_1"], SP.Name)),par], SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]])) + 1`

I am accessing the DataFrame row using occursin to find the row name. I use findfirst to return a value rather than a vector with one value. And finally I filter out NaNs because if there is a NaN, maximum returns NaN, and I don’t want that. Is there a simpler, clearer way of accessing the DataFrame?

Assuming your data looks like this:

``````julia> SP = DataFrame(:KPI_1=>[5,17,23,33,NaN],
:KPI_2=>[11, NaN, 22, 4,NaN])
5×2 DataFrame
Row │ KPI_1    KPI_2
│ Float64  Float64
─────┼──────────────────
1 │     5.0     11.0
2 │    17.0    NaN
3 │    23.0     22.0
4 │    33.0      4.0
5 │   NaN      NaN
``````

You could evaluate `Nsta` for a single row.

``````julia> max_skipnan(v...; allnans=0) = maximum(x->isnan(x) ? -Inf : x, v;
init=allnans)
max_skipnan (generic function with 1 method)

julia> max_skipnan(SP.KPI_1[par], SP.KPI_2[par]) + 1
18.0
``````

Or add a new column for `Nsta` like this:

``````julia> transform(SP,
[:KPI_1, :KPI_2]=>((kpi1, kpi2)->max_skipnan.(kpi1, kpi2) .+ 1)=>:Nsta)
5×3 DataFrame
Row │ KPI_1    KPI_2    Nsta
│ Float64  Float64  Float64
─────┼───────────────────────────
1 │     5.0     11.0     12.0
2 │    17.0    NaN       18.0
3 │    23.0     22.0     24.0
4 │    33.0      4.0     34.0
5 │   NaN      NaN        1.0
``````
2 Likes

You can index the `row` directly `row["KPI_1"]`, so the `findfirst` stuff is definitely overkill.

I think it’s weird that Matlab skips `NaN`s here. One questions to ask is whether you want to work with `NaN` values or if you should use `missing` values instead. But granted, skipping over `NaN` (or `missing`) is a pain.

DataFramesMeta isn’t well suited for this use-case at the moment. But here is an alternate DataFrames version. You still have to make secondary function

``````julia> function maximum_nonan(t)
t = filter(!isnan, collect(t))
isempty(t) && return NaN
return maximum(t)
end
maximum_nonan (generic function with 1 method)

julia> transform(SP, AsTable([:KPI_1, :KPI_2]) => ByRow(maximum_nonan) => :h)
5×3 DataFrame
Row │ KPI_1    KPI_2    h
│ Float64  Float64  Float64
─────┼───────────────────────────
1 │     5.0     11.0     11.0
2 │    17.0    NaN       17.0
3 │    23.0     22.0     23.0
4 │    33.0      4.0     33.0
5 │   NaN      NaN      NaN
``````
2 Likes

Using pipes:

``````using DataFrames

SP = DataFrame(:KPI_1=>[5,17,23,33,NaN], :KPI_2=>[11,NaN,22,4,NaN])

SP[!,[:KPI_1, :KPI_2]] |>
x -> ifelse.(isnan.(x),-Inf,x) |>
x -> maximum([x[:KPI_1] x[:KPI_2]], dims=2)[:] .+ 1 |>
x -> SP.max_KPI12_plus1 = ifelse.(isinf.(x),NaN,x)

julia> SP
5×3 DataFrame
│ Row │ KPI_1   │ KPI_2   │ max_KPI12_plus1 │
│     │ Float64 │ Float64 │ Float64         │
├─────┼─────────┼─────────┼─────────────────┤
│ 1   │ 5.0     │ 11.0    │ 12.0            │
│ 2   │ 17.0    │ NaN     │ 18.0            │
│ 3   │ 23.0    │ 22.0    │ 24.0            │
│ 4   │ 33.0    │ 4.0     │ 34.0            │
│ 5   │ NaN     │ NaN     │ NaN             │
``````
1 Like

Thanks for the replies! I’m going to have to study them to understand what you are saying. But to clarify, KPI_1 and KPI_2 are ‘row names’ and par is a variable containing the column name. An example of the DataFrame is shown below, though there are many more columns than shown.
A simplified description of what I am trying to do is to return 2 if there is a 2 in the KPI_2 row, otherwise return 1. So I suppose I could simplify my code to:

`if isnan(SP[findfirst(occursin.(["KPI_1"], SP.Name)),par]) 2 else 3 end`

Example table:

`````` Row │ Item                               Name       Vel_B1_FTF  Vel_B1_BSF
│ String                             String     Float64     Float64
─────┼──────────────────────────────────────────────────────────────────────
1 │ Order of Component                 Order         NaN         NaN
2 │ Search Range for highest compone…  searchLow      -0.025      -0.025
3 │ Search Range for highest compone…  searchUpp       0.01        0.01
4 │ Order wrt the component            KPI_1           1.0         1.0
5 │ Order wrt the component            KPI_2         NaN           2.0
6 │ Order wrt the component (Prev+1 …  KPI_N           5.0         5.0
``````

I’ll go though what you guys have said, and see if I can get rid of that `findfirst` function.

This is an odd block of code. Why make a rowvector on the second block?

Why use the `|>` and anonymous functions when you could use `Chain.jl`?

Okay thanks for this. We were definitely mis-understanding what you wanted.

I think you want `combine` or `transform` with a `GroupedDataFrame`. I’m assuming you don’t just want the maximum of the `KPI_1` and `KPI_2` values, but rather want to make a transformation by group. This will do that

``````julia> using DataFramesMeta, Chain;

julia> df = DataFrame(
Item = ["a", "a", "b", "b"],
Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4×3 DataFrame
Row │ Item    Name       Vel_B1_FTF
│ String  String     Float64
─────┼───────────────────────────────
1 │ a       searchLow       NaN
2 │ a       searchUpp         0.1
3 │ b       KPI_1             0.4
4 │ b       KPI_2           NaN

julia> par = :Vel_B1_FTF;

julia> @chain df begin
groupby(:Item)
@transform max_in_group = begin
x = filter(!isnan, cols(par))
isempty(x) && return NaN
maximum(x)
end
end
4×4 DataFrame
Row │ Item    Name       Vel_B1_FTF  max_in_group
│ String  String     Float64     Float64
─────┼─────────────────────────────────────────────
1 │ a       searchLow       NaN             0.1
2 │ a       searchUpp         0.1           0.1
3 │ b       KPI_1             0.4           0.4
4 │ b       KPI_2           NaN             0.4
``````

If you really just want the maximum of `KPI_1` and `KPI_2`, then yeah that’s tough.

``````julia> function get_max(df, vals, namevar, keys)
x = df[!, vals]
n = df[!, namevar]
inds = findall(t -> t in keys, n)
# handle empty collection as needed
maximum(filter(!isnan, x[inds]))
end;

julia> get_max(df, par, :Name, ["KPI_1", "KPI_2"])
0.4
``````

So there are two things that make this tough in Julia, unfortunately. The first is the lack of row indexing, which you can do in Matlab I guess. The second is that Julia propagates `NaN`s in maximum while Matlab does not.

Propagation of `NaN`s in Julia is presumably governed by IEEE rules, so no changing that.

Additionally, `maximum` errors on an empty collection. This might actually be changeable. Maybe we should add a keyword argument for `ifempty` in `maximum`, similar to `init` in `sum`.

2 Likes

KPI_1 will always be 1, KPI_2 will always be 2 or NaN. I want an output of 3 if KPI_2 is 2 or an output of 2 if KPI_2 is NaN. Because of how Matlab processes NaN’s, using max was convenient.

`if isnan(SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]) 2 else 3 end` does what I want. Is there any benefit to creating a function like your `get_max` above rather than using an if-else statement?

The reason for using the findfirst function is that the following line returns a 1-element Vector.

`SP[occursin.(["KPI_2"], SP.Name),par]`

`isnan` throws an error on that, and comparators don’t like comparing 1-element Vectors with numbers not in a vector.

Thinking about it, indexing to the only value in the response removes the need for the `findfirst` function. This may be about as simple a solution as I can come up with without row indexing in DataFrames.

`if isnan(SP[occursin.(["KPI_2"], SP.Name),par]) Nsta = 2 else Nsta = 3 end`

``````KPI_2_exists = !isnan(SP[occursin.(["KPI_2"], SP.Name),par])

if KPI_2_exists == 1
Nsta = 3
else
Nsta = 2
end
``````

Looking at all the responses, it seems that the go-to solution for everything in Julia is to create a function. I’m used to minimize functions in code to keep things all together. Should I be looking to move towards using functions more frequently? Is a function more efficient than the above code?

Yes, definitely use a function. There are performance benefits to using functions in that Julia can infer types better when things are inside a function. Defining a function and then using it in right there is costless, and makes it easier to write code.

Looking at this code, I’m more convinced you want to be using `missing` instead of `NaN`. `missing` is designed to mean “unknown value”, whereas `NaN` is something different.

Yes, `findfirst` is better than `occursin` because it returns an element of the array. (`findfirst` returns an integer index, but `occursin` returns an array of `Bool`s.

Here is another way of writing the above

``````julia> SP = DataFrame(
Item = ["a", "a", "b", "b"],
Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4×3 DataFrame
Row │ Item    Name       Vel_B1_FTF
│ String  String     Float64
─────┼───────────────────────────────
1 │ a       searchLow       NaN
2 │ a       searchUpp         0.1
3 │ b       KPI_1             0.4
4 │ b       KPI_2           NaN

julia> KPI_2_exists = let
ind = findfirst(==("KPI_2"), SP.Name)
!isnan(SP[!,par][ind])
end;

julia> Nsta = KPI_2_exists ? 3 : 2
2
``````
1 Like

Whether the unknown values are missing or NaN is a function of how Julia imports blanks, but I can modify that after I import it. But Julia treats both similarly in comparisons, as far as I can tell.

``````julia> KPI_2_exists = let
ind = findfirst(==("KPI_2"), SP.Name)
!isnan(SP[!,par][ind])
end;
``````

would be equivalent to

``````ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[!,par][ind])
``````

or even

``````ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[ind,par])
``````

except the ind variable disappears in the “let” case?

They should all be equivalent, yeah. Be careful about `!isnan(SP[!,par][ind]`, though. That’s fast, but `!isnan(SP[:, par][ind])` will be slow (the `:` allocates a new vector).

It’s good practice to use `let` for these multi-line blocks so you don’t clutter your namespace with temporary variables.

1 Like

Thanks! I was a little confused about the difference between the ! and : punctuations for DataFrames. And the difference between creating new variables and working on existing ones. I’m a little closer now 1 Like