How to train MLJ model on DataFrame, but apply on Vector of Arrays?

niltsz · May 24, 2024, 9:40am

I train my MLJ Model (EvoTrees.jl) on a Julia DataFrame with the scalars x1, x2, and x3 as features and y as label. However, when doing the inference, I have higher dimensional arrays instead of tabular data. Any suggestions how to code this in a nice way?

df = DataFrame()
#input is nearest neighbor from a grid
df.x1 = rand(100)
df.x2 = rand(100)
df.x3 = rand(100)
#target is measured at a point within grid
df.y = rand(100)

ETR = MLJ.@load EvoTreeRegressor pkg=EvoTrees
evotree = ETR()
mach = machine(evotree, df[:, Not(:y)], df.y)
fit!(mach)

######## inference ########
X1_inf = rand(100, 100, 10)
X2_inf = rand(100, 100, 10)
X3_inf = rand(100, 100, 10)

#this is what I want to do, 
#I can come up with ways that work but feel wrong, 
#like three for loops and casting, but how to do it in a smart way?
Y_inf = MLJ.predict(mach, [X1_inf, X2_inf, X3_inf])

The most simple way would be to train on arrays instead of a DataFrame and then broadcast, or, during inference, to make a for loop over all indices and then cast it to a DataFrame. But is there a smarter way?

For context if this helps, I have a climate model that is gridded (latitude, longitude, height), and a plane trajectory going through the grid. For training, I take the weather (e.g. temperature, air pressure and wind direction) at the nearest neighbor from the grid as features and the measurements (e.g. relative humidity) from the plane as target, so tabular data. For inference, I want to feed in all grid points of the weather model and get out a grid as well.

bertschi · May 24, 2024, 3:49pm

Seems that predict already works on arrays, i.e.,

julia> MLJ.predict(mach, rand(4, 3))  # Just pass 3 columns instead of a data frame
4-element Vector{Float32}:
 0.354614
 0.889316
 0.60572267
 0.23449774

Given that you have several options to run over the grid:

Apply model elementwise and collect results:

julia> @time stack((x1,x2,x3) -> only(MLJ.predict(mach, [x1 x2 x3])), X1_inf, X2_inf, X3_inf) |> size
18.615153 seconds (65.36 M allocations: 6.271 GiB, 4.73% gc time, 0.48% compilation time)
(100, 100, 10)

This is rather slow though

Create a suitable array, e.g., via reshaping

function predict_grid(mach, x1, x2, x3)
    s = size(x1)
    @assert s == size(x2) == size(x3)
    x = reshape(stack([x1, x2, x3]), prod(s), 3)
    reshape(MLJ.predict(mach, x), s...)
end

which runs much faster

julia> @time predict_grid(mach, X1_inf, X2_inf, X3_inf)  |> size
0.235947 seconds (664 allocations: 3.402 MiB)
(100, 100, 10)

niltsz · May 25, 2024, 5:44pm

@bertschi @brandon698sherrick
Thanks a lot for the help. I will reshape my data in a separate funciton.

Topic		Replies	Views
Using a trained MLJ model for prediction on non-Table objects Machine Learning question	1	51	May 18, 2025
No model matching using MLJ, how should I prepare my data? Data mlj	3	323	February 22, 2023
GLM.jl Is posulbie to use lm() for two vectors without DataFrames? General Usage glm	2	622	November 4, 2019
Best input/output format for Julia ML packages General Usage	8	462	October 1, 2022
Simple Table Operation Has Very Large Compilation Time with MLJ General Usage question , dataframes , mlj , tables	20	874	June 12, 2022

How to train MLJ model on DataFrame, but apply on Vector of Arrays?

Related topics