Apply interpolation functions on columns of a dataframe

Hello,

I’m a beginner in Julia…
i have:

  1. a dataframe with 4 columns and 10 rows
  2. three functions u1, u2, u3
  3. a vector of 3 weights

And i want to create a function (arguments : the dataframe, the 3 functions and the weights ) this function must return a vector built as follows

  • we apply the function u1 on values of column 2, the function u2 on values of column 3, the function u3 on values of column 4
  • then we do, for each row, a weighted sum with vector w
  • so, we have at the end a vector with 10 values

I’ve tried (with help of the community) by using transform but it doesn’t work with functions obtained by linear interpolations.

using Interpolations
x1=sort(vf[1][:,2])
y1=reverse(vf[1][:,1])
f1=LinearInterpolation(x1,y1)

where vf[1] is 3x2 matrix

Have you an idea ?
Thanks for for your help.

What does the error say?

Can you please provide a minimum working example?

LinearInterpolation does not produce a Function but a functor. to turn f1 into a function use an anonymous function wrapper x -> f1(x) or composition identity∘f1.

Two things could be done:

  1. allow functors in DataFrames.jl (I am hesitant as it will make even harder for users to reason about the transformation minilanguage)
  2. As maintainers of LinearInterpolation to make it a Function (I do not know the details why it is not a function).
1 Like

This is interesting. I will see if there is something I can do in DataFramesMeta that helps this.

In @transform you would do something along the lines of

@transform df : y = identity(f1(...))

It doesn’t work with identity

using DataFrames, DataFramesMeta, Interpolations

df = DataFrame(u1 = rand(10), u2 = rand(10), u3 = rand(10))

vf = [[0 25000; 0.5 10000; 1 8000],
    [0 32; 0.5 29; 1 26],
    [0 45; 0.5 37; 1 30],
    [0 0; 0.5 2; 1 4],
    [0 0; 0.5 3; 1 4]];

x1=sort(vf[1][:,2]);
y1=reverse(vf[1][:,1]);
f1=LinearInterpolation(x1,y1);

x2=sort(vf[2][:,2]);
y2=reverse(vf[2][:,1]);
f2=LinearInterpolation(x2,y2);

x3=sort(vf[3][:,2]);
y3=reverse(vf[3][:,1]);
f3=LinearInterpolation(x3,y3);

w = [.5, .2, .3];

@transform df :z = w[1] * identity(f1(:u1)) + w[2] * identity(f2(:u2)) + w[3] * identity(f3(:u3))

Here is the error

BoundsError: attempt to access 3-element extrapolate(interpolate((::Vector{Float64},), ::Vector{Float64}, Gridded(Linear())), Throw()) with element type Float64 at index [0.9017848022215782]

Thanks in advance.

Sorry, the Functor issue was not the problem. (We couldn’t tell because you did not provide an MWE at first).

It seems there is something about LinearInterpolation which you don’t understand, and I don’t either. The error has nothing to do with DataFrames.

julia> f1(df.u1)
ERROR: BoundsError: attempt to access 3-element extrapolate(interpolate((::Vector{Float64},), ::Vector{Float64}, Gridded(Linear())), Throw()) with element type Float64 at index [0.2256458588860888]

Maybe someone with better knowledge of linear interpolations can help.

You’re just evaluating the interpolant outside the grid provided:

julia> x1
3-element Vector{Float64}:
  8000.0
 10000.0
 25000.0

julia> y1
3-element Vector{Float64}:
 1.0
 0.5
 0.0

julia> f1=LinearInterpolation(x1,y1);

julia> f1(8_000)
1.0

julia> f1(10_000)
0.5

julia> f1(9_000)
0.75

julia> f1(1)
ERROR: BoundsError: attempt to access 3-element extrapolate(interpolate((::Vector{Float64},), ::Vector{Float64}, Gridded(Linear())), Throw()) with element type Float64 at index [1]

if you want to extrapolate you need to be explicit about how:

julia> f2 = LinearInterpolation(x1, y1, extrapolation_bc = Line());

julia> f2(1)
2.99975

(this is all covered in the first example of the docs here)

1 Like

You’re right !
Thanks !

Why doesn’it work inside a function ?

function values(data, w)
    data_temp = deepcopy(data)
    @transform data_temp :value=w[1]*f1(:u1)+w[2]*f2(:u2)+w[3]*f3(:u3)
    return data_temp
end

There is no error but nothing is returned.
Thanks for your help.

You don’t need the deepcopy. @transform already makes a copy. you probably want

function values(data, w)
    data_temp = @transform data :value=w[1]*f1(:u1)+w[2]*f2(:u2)+w[3]*f3(:u3)
    return data_temp
end
1 Like

Thanks a lot for your patience and your help !

On SO I have explained the issue of functor vs function I have mentioned above with an MWE:
https://stackoverflow.com/questions/69925933/linearinterpolation-not-working-with-transform-in-dataframes-jl

Objects such as li are called functors in Julia and sometimes their authors opt-out of making them a subtype of Function

Why do you say that authors opt-out of subtyping a function? As I understand, it’s exactly the opposite, ie opt-in: one has to explicitly write struct F <: Function, this subtyping is not automatic whenever function (f::F)(args) is defined.

Functors behave the same as equivalent functions in the vast majority of places in julia, so there are typically few reasons to subtype Function. There are exceptions of course, also see a recent discussion of subtyping and possible drawbacks: Consider subtyping `Function` · Issue #37 · JuliaObjects/Accessors.jl · GitHub.

You are right. I should have said “do not opt-in which is required by DataFrames.jl”

Your comments are valid, but the question was specifically in the DataFrames.jl context. In this context functions like transform use dispatch to determine their behavior.
In particular as explained in julia - LinearInterpolation not working with transform in DataFrames.jl - Stack Overflow only Base.Callable objects are considered to be transformation functions. We cannot change this rule as using dispatch is the only way to decide how an arbitrary object passed to function like transform should be handled.

An alternative would be to have a set of traits like e.g. potentially having iscallable, but there is no such thing currently.

Also related to:

the compiler de-optimizes a higher-order function that does not call the argument

This is exactly what we want in DataFrames.jl (but it is a secondary consideration - a primary one is that we need dispatch to decide the behavior). The reason why we want it is that despecialization reduces compilation latency and since DataFrames.jl is a package that is used interactively in a majority of cases people want despecialization (technically: we despecialize everything expect the functions that do heavy computations only which are specialized).

1 Like

Thanks !
I will know it now

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

Here’s a functor that applies learned normalization:

using DataFrames
using Statistics: mean, std

struct Normalizer
    μ
    σ
end

Normalizer(x::AbstractVector) = Normalizer(mean(x), std(x))

function (m::Normalizer)(x::Real)
    return (x - m.μ) / m.σ
end

function (m::Normalizer)(x::AbstractVector)
    return (x .- m.μ) ./ m.σ
end

df = DataFrame(:v1 => rand(5), :v2 => rand(5))
feat_names = names(df)
norms = map((feat) -> Normalizer(df[:, feat]), feat_names)

As discussed earlier, the following doesn’t work:

transform(df, feat_names .=> norms .=> feat_names)
ERROR: LoadError: ArgumentError: Unrecognized column selector: "v1" => (Normalizer(0.5407170762469404, 0.1599492895436335) => "v1")

However, somewhat surprisingly, using ByRow does work:

transform(df, feat_names .=> ByRow.(norms) .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   1 │  0.0386826   0.479449
   2 │  0.919179   -1.61432
   3 │  1.05579     0.584841
   4 │ -0.930937    0.854153
   5 │ -1.08272    -0.304124

So to use the vectorized form, it seems like a mapping of the Functors into Functions is required:

norms_f = map(f -> (x) -> f(x), norms)
transform(df, feat_names .=> norms_f .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   1 │  0.0386826   0.479449
   2 │  0.919179   -1.61432
   3 │  1.05579     0.584841
   4 │ -0.930937    0.854153
   5 │ -1.08272    -0.304124

I can see that there’s a not too complicated way to circumvent the functor limitation through that remapping. Yet, isn’t it counterintuitive to see the Functor works in the ByRow but not in the vectorized case? Having the opportunity to recognize Functors as Functions in the transform would be their most natural handling in my opinion.

This is a completely different dispatch path internally.

Can you open an issue for this an we can discuss what can be done about it.

1 Like

Issue opened: https://github.com/JuliaData/DataFrames.jl/issues/2984