Functor preprocessing pattern

ctrebbau · August 23, 2022, 2:29pm

I’m trying to adapt the preprocessing pattern showcased by
https://github.com/jeremiedb
here https://www.youtube.com/watch?v=dzChtn9WWT8
He basically creates a structure to store transformation functions and a functor that returns the transformation itself:

struct Preproc <: Function
    layers::Vector
end

function (p::Preproc)(df::DataFrame, ids=nothing)
    df = copy(df)
    ids = isnothing(ids) ? range(1, length(p.layers), step=1) : ids
    if length(ids) == 1
        transform!(df, p.layers[ids[1]])
    else
        for layer in p.layers[ids]
            transform!(df, layer)
        end
    end
    return df
end

I wish to boxcox transform my target variable while storing its power parameter to apply later to novel observatios, so I’ve used BoxCoxTrans and defined the following:

struct BoxCox{T} <: Function
    λ::T
end

BoxCox(x::AbstractVector) = BoxCox(BoxCoxTrans.lambda(x).value)

function (m::BoxCox)(x::Union{Real,Missing})
    return BoxCoxTrans.transform(x, m.λ)
end

However, when I push and “fit” with

function build_preproc(df)

    df_fit = copy(df) 
    preproc = Preproc([])

    push!(preproc.layers, :wait => BoxCox => :wait)
    df_fit = preproc(df_fit)

    return preproc
end

df I get this on every cell

instead of the transformed values and I just cannot, for the life of me, get it to return the transformed variable.

Here are the links jeremie’s work

github.com

JuliaComputing/ScoringEngineDemo.jl/blob/main/src/preproc-utils.jl

using CSV
using DataFrames
using Statistics: mean, std
using StatsBase: sample
using BSON


load_data(path) = CSV.File(path) |> DataFrame

function data_splits(df, train_perc)
    train_id = sample(1:nrow(df), Int(floor(train_perc * nrow(df))), replace = false, ordered = false)
    df_train = df[train_id, :]
    df_eval = df[InvertedIndex(train_id), :]
    return df_train, df_eval
end

"""
Preproc
Preproc functor. Holds a vector of transform operations
"""

This file has been truncated. show original

github.com

JuliaComputing/ScoringEngineDemo.jl/blob/main/src/preproc.jl

using CSV
using DataFrames
using Statistics: mean, std
using StatsBase: sample
using BSON

function build_preproc(df; norm_feats)

    df_fit = copy(df) # shoud not mutate the input df
    preproc = Preproc([])

    # density transformation
    push!(preproc.layers, ["population", "town_surface_area"] => density => "density")
    push!(preproc.layers, "density" => (x -> log.(max.(x, 0.01))) => "density")
    df_fit = preproc(df_fit, 1:2)

    push!(preproc.layers, cov_mapping)
    df_fit = preproc(df_fit, 3)

    push!(preproc.layers, drv_sex1)

This file has been truncated. show original

Thank you in advance.

Benny · August 23, 2022, 5:56pm

I don’t really understand the math(?) being done here, but let me ask if I’m reading this right. By transformed value, you mean the expression BoxCoxTrans.transform(x, m.λ) returned by the (m::BoxCox) method right? So the issue is that you’re reaching the callable BoxCox{Float64} instances and need that extra step of calling them, given some input for x::Union{Real,Missing}?

ctrebbau · August 23, 2022, 6:19pm

I’m no expert on cats either but I think you got it right, it’s that last step that eludes me

Benny · August 23, 2022, 7:10pm

My hunch is that the fix has to be in the :wait => BoxCox => :wait part. I’ve only used DataFrames rarely to limited degrees, so hopefully someone who knows it better can help you out. Maybe it would help them if you told them what the :wait data is (or a simulated example data), and what inputs you expect BoxCox and BoxCox{Float64} to be called on.

ctrebbau · August 24, 2022, 7:21am

Indeed, I’ve tried variations of :wait => BoxCox => :wait such as :wait => ByRow(BoxCox) => :wait which I think maybe got me a step closer since this is what I’m now getting

while before the 0.25... coincides with the BoxCox lambda.

Also, :wait => (w -> BoxCox) => :wait gets me

On the other hand :wait => (w -> BoxCox(w)) => :wait returns the same as :wait => BoxCox => :wait

As for the interpretation of :wait it’s a slightly skewed normally distributed waiting time represented as a <::Number. Hope that helps.

aplavin · August 25, 2022, 11:06am

Looks like you don’t apply your functor after creating it, and end up with the same object in the result. I never use DataFrames and not familiar with how functors play with their custom transformation syntax, but for regular arrays the transformation would look like this:

wait = ... # your vector of numbers
b = BoxCox(wait)
wait = b.(wait)  # vector of transformed values

ctrebbau · August 25, 2022, 12:41pm

Trying to broadcast with :wait => (w -> BoxCox.(w)) => :wait) returns same as ByRow(BoxCox) version above

ctrebbau · September 8, 2022, 1:18pm

:wait => BoxCox(df_fit[:, :wait]) => :wait

Topic		Replies	Views
Apply interpolation functions on columns of a dataframe General Usage question , dataframes , interpolations	17	1261	January 7, 2022
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	702	December 23, 2022
Transform several columns of an MLJ model using one transformer Machine Learning question , package , mlj	13	641	November 2, 2023
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	569	December 7, 2022
How to specify a transformation with multiple arguments via minilanguage General Usage question , dataframes	5	270	October 31, 2022

Functor preprocessing pattern

Related topics