# Accessing DataFrames - is there a simpler way?

Matlab user here wondering if I am making Julia overcomplicated. I am importing numeric and string values from a database and looking for certain values to use. In Matlab, using Tables, my code looks like this:

`Nsta = max([SP{"KPI_1",par}, SP{"KPI_2",par}])+1;`

In Julia, my code is longer and more complex:

`Nsta = maximum(filter(!isnan,[SP[findfirst(occursin.(["KPI_1"], SP.Name)),par], SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]])) + 1`

I am accessing the DataFrame row using occursin to find the row name. I use findfirst to return a value rather than a vector with one value. And finally I filter out NaNs because if there is a NaN, maximum returns NaN, and I donβt want that. Is there a simpler, clearer way of accessing the DataFrame?

Assuming your data looks like this:

``````julia> SP = DataFrame(:KPI_1=>[5,17,23,33,NaN],
:KPI_2=>[11, NaN, 22, 4,NaN])
5Γ2 DataFrame
Row β KPI_1    KPI_2
β Float64  Float64
ββββββΌββββββββββββββββββ
1 β     5.0     11.0
2 β    17.0    NaN
3 β    23.0     22.0
4 β    33.0      4.0
5 β   NaN      NaN
``````

You could evaluate `Nsta` for a single row.

``````julia> max_skipnan(v...; allnans=0) = maximum(x->isnan(x) ? -Inf : x, v;
init=allnans)
max_skipnan (generic function with 1 method)

julia> max_skipnan(SP.KPI_1[par], SP.KPI_2[par]) + 1
18.0
``````

Or add a new column for `Nsta` like this:

``````julia> transform(SP,
[:KPI_1, :KPI_2]=>((kpi1, kpi2)->max_skipnan.(kpi1, kpi2) .+ 1)=>:Nsta)
5Γ3 DataFrame
Row β KPI_1    KPI_2    Nsta
β Float64  Float64  Float64
ββββββΌβββββββββββββββββββββββββββ
1 β     5.0     11.0     12.0
2 β    17.0    NaN       18.0
3 β    23.0     22.0     24.0
4 β    33.0      4.0     34.0
5 β   NaN      NaN        1.0
``````
2 Likes

You can index the `row` directly `row["KPI_1"]`, so the `findfirst` stuff is definitely overkill.

I think itβs weird that Matlab skips `NaN`s here. One questions to ask is whether you want to work with `NaN` values or if you should use `missing` values instead. But granted, skipping over `NaN` (or `missing`) is a pain.

DataFramesMeta isnβt well suited for this use-case at the moment. But here is an alternate DataFrames version. You still have to make secondary function

``````julia> function maximum_nonan(t)
t = filter(!isnan, collect(t))
isempty(t) && return NaN
return maximum(t)
end
maximum_nonan (generic function with 1 method)

julia> transform(SP, AsTable([:KPI_1, :KPI_2]) => ByRow(maximum_nonan) => :h)
5Γ3 DataFrame
Row β KPI_1    KPI_2    h
β Float64  Float64  Float64
ββββββΌβββββββββββββββββββββββββββ
1 β     5.0     11.0     11.0
2 β    17.0    NaN       17.0
3 β    23.0     22.0     23.0
4 β    33.0      4.0     33.0
5 β   NaN      NaN      NaN
``````
2 Likes

Using pipes:

``````using DataFrames

SP = DataFrame(:KPI_1=>[5,17,23,33,NaN], :KPI_2=>[11,NaN,22,4,NaN])

SP[!,[:KPI_1, :KPI_2]] |>
x -> ifelse.(isnan.(x),-Inf,x) |>
x -> maximum([x[:KPI_1] x[:KPI_2]], dims=2)[:] .+ 1 |>
x -> SP.max_KPI12_plus1 = ifelse.(isinf.(x),NaN,x)

julia> SP
5Γ3 DataFrame
β Row β KPI_1   β KPI_2   β max_KPI12_plus1 β
β     β Float64 β Float64 β Float64         β
βββββββΌββββββββββΌββββββββββΌββββββββββββββββββ€
β 1   β 5.0     β 11.0    β 12.0            β
β 2   β 17.0    β NaN     β 18.0            β
β 3   β 23.0    β 22.0    β 24.0            β
β 4   β 33.0    β 4.0     β 34.0            β
β 5   β NaN     β NaN     β NaN             β
``````
1 Like

Thanks for the replies! Iβm going to have to study them to understand what you are saying. But to clarify, KPI_1 and KPI_2 are βrow namesβ and par is a variable containing the column name. An example of the DataFrame is shown below, though there are many more columns than shown.
A simplified description of what I am trying to do is to return 2 if there is a 2 in the KPI_2 row, otherwise return 1. So I suppose I could simplify my code to:

`if isnan(SP[findfirst(occursin.(["KPI_1"], SP.Name)),par]) 2 else 3 end`

Example table:

`````` Row β Item                               Name       Vel_B1_FTF  Vel_B1_BSF
β String                             String     Float64     Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β Order of Component                 Order         NaN         NaN
2 β Search Range for highest componeβ¦  searchLow      -0.025      -0.025
3 β Search Range for highest componeβ¦  searchUpp       0.01        0.01
4 β Order wrt the component            KPI_1           1.0         1.0
5 β Order wrt the component            KPI_2         NaN           2.0
6 β Order wrt the component (Prev+1 β¦  KPI_N           5.0         5.0
``````

Iβll go though what you guys have said, and see if I can get rid of that `findfirst` function.

This is an odd block of code. Why make a rowvector on the second block?

Why use the `|>` and anonymous functions when you could use `Chain.jl`?

Okay thanks for this. We were definitely mis-understanding what you wanted.

I think you want `combine` or `transform` with a `GroupedDataFrame`. Iβm assuming you donβt just want the maximum of the `KPI_1` and `KPI_2` values, but rather want to make a transformation by group. This will do that

``````julia> using DataFramesMeta, Chain;

julia> df = DataFrame(
Item = ["a", "a", "b", "b"],
Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4Γ3 DataFrame
Row β Item    Name       Vel_B1_FTF
β String  String     Float64
ββββββΌβββββββββββββββββββββββββββββββ
1 β a       searchLow       NaN
2 β a       searchUpp         0.1
3 β b       KPI_1             0.4
4 β b       KPI_2           NaN

julia> par = :Vel_B1_FTF;

julia> @chain df begin
groupby(:Item)
@transform max_in_group = begin
x = filter(!isnan, cols(par))
isempty(x) && return NaN
maximum(x)
end
end
4Γ4 DataFrame
Row β Item    Name       Vel_B1_FTF  max_in_group
β String  String     Float64     Float64
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββ
1 β a       searchLow       NaN             0.1
2 β a       searchUpp         0.1           0.1
3 β b       KPI_1             0.4           0.4
4 β b       KPI_2           NaN             0.4
``````

If you really just want the maximum of `KPI_1` and `KPI_2`, then yeah thatβs tough.

``````julia> function get_max(df, vals, namevar, keys)
x = df[!, vals]
n = df[!, namevar]
inds = findall(t -> t in keys, n)
# handle empty collection as needed
maximum(filter(!isnan, x[inds]))
end;

julia> get_max(df, par, :Name, ["KPI_1", "KPI_2"])
0.4
``````

So there are two things that make this tough in Julia, unfortunately. The first is the lack of row indexing, which you can do in Matlab I guess. The second is that Julia propagates `NaN`s in maximum while Matlab does not.

Propagation of `NaN`s in Julia is presumably governed by IEEE rules, so no changing that.

Additionally, `maximum` errors on an empty collection. This might actually be changeable. Maybe we should add a keyword argument for `ifempty` in `maximum`, similar to `init` in `sum`.

2 Likes

KPI_1 will always be 1, KPI_2 will always be 2 or NaN. I want an output of 3 if KPI_2 is 2 or an output of 2 if KPI_2 is NaN. Because of how Matlab processes NaNβs, using max was convenient.

`if isnan(SP[findfirst(occursin.(["KPI_2"], SP.Name)),par]) 2 else 3 end` does what I want. Is there any benefit to creating a function like your `get_max` above rather than using an if-else statement?

The reason for using the findfirst function is that the following line returns a 1-element Vector.

`SP[occursin.(["KPI_2"], SP.Name),par]`

`isnan` throws an error on that, and comparators donβt like comparing 1-element Vectors with numbers not in a vector.

Thinking about it, indexing to the only value in the response removes the need for the `findfirst` function. This may be about as simple a solution as I can come up with without row indexing in DataFrames.

`if isnan(SP[occursin.(["KPI_2"], SP.Name),par][1]) Nsta = 2 else Nsta = 3 end`

``````KPI_2_exists = !isnan(SP[occursin.(["KPI_2"], SP.Name),par][1])

if KPI_2_exists == 1
Nsta = 3
else
Nsta = 2
end
``````

Looking at all the responses, it seems that the go-to solution for everything in Julia is to create a function. Iβm used to minimize functions in code to keep things all together. Should I be looking to move towards using functions more frequently? Is a function more efficient than the above code?

Yes, definitely use a function. There are performance benefits to using functions in that Julia can infer types better when things are inside a function. Defining a function and then using it in right there is costless, and makes it easier to write code.

Looking at this code, Iβm more convinced you want to be using `missing` instead of `NaN`. `missing` is designed to mean βunknown valueβ, whereas `NaN` is something different.

Yes, `findfirst` is better than `occursin` because it returns an element of the array. (`findfirst` returns an integer index, but `occursin` returns an array of `Bool`s.

Here is another way of writing the above

``````julia> SP = DataFrame(
Item = ["a", "a", "b", "b"],
Name = ["searchLow", "searchUpp", "KPI_1", "KPI_2"],
Vel_B1_FTF = [NaN, 0.1, 0.4, NaN])
4Γ3 DataFrame
Row β Item    Name       Vel_B1_FTF
β String  String     Float64
ββββββΌβββββββββββββββββββββββββββββββ
1 β a       searchLow       NaN
2 β a       searchUpp         0.1
3 β b       KPI_1             0.4
4 β b       KPI_2           NaN

julia> KPI_2_exists = let
ind = findfirst(==("KPI_2"), SP.Name)
!isnan(SP[!,par][ind])
end;

julia> Nsta = KPI_2_exists ? 3 : 2
2
``````
1 Like

Whether the unknown values are missing or NaN is a function of how Julia imports blanks, but I can modify that after I import it. But Julia treats both similarly in comparisons, as far as I can tell.

``````julia> KPI_2_exists = let
ind = findfirst(==("KPI_2"), SP.Name)
!isnan(SP[!,par][ind])
end;
``````

would be equivalent to

``````ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[!,par][ind])
``````

or even

``````ind = findfirst(==("KPI_2"), SP.Name)
KPI_2_exists = !isnan(SP[ind,par])
``````

except the ind variable disappears in the βletβ case?

They should all be equivalent, yeah. Be careful about `!isnan(SP[!,par][ind]`, though. Thatβs fast, but `!isnan(SP[:, par][ind])` will be slow (the `:` allocates a new vector).

Itβs good practice to use `let` for these multi-line blocks so you donβt clutter your namespace with temporary variables.

1 Like

Thanks! I was a little confused about the difference between the ! and : punctuations for DataFrames. And the difference between creating new variables and working on existing ones. Iβm a little closer now

1 Like