No model matching using MLJ, how should I prepare my data?

Hi,

I’m taking one of my first steps into MLJ by writing a simple script that can predict diverse fields for my research. I know what I want to do can be achieved using simpler interpolation methods or Flux.jl, but I want to do it using MLJ, just for me to learn and explore the ecosystem.

Here is my goal: I have a set of Array{Union{Missing, Float32}, 2}, each one associated with a triplet [p1, p2, p3] of parameters. I want to train the model on this set, and obtain a prediction function that takes any triplet of parameters as input and returns “the best matching” array.
My issue is that there is no model matching my data, according to models(matching(X)) where X is a Vector{Vector{Float64}} (I removed the missing value from the original arrays).
In summary, my dataset has the following schema:

┌────────┬────────────────────────────┬─────────────────┐
│ names  │ scitypes                   │ types           │
├────────┼────────────────────────────┼─────────────────┤
│ param  │ AbstractVector{Continuous} │ Vector{Float64} │
│ field  │ AbstractVector{Continuous} │ Vector{Float64} │
└────────┴────────────────────────────┴─────────────────┘

I can easily remove the vector structure of param and make 3 different columns, but I want to keep the field as a whole.

How else would you format such data so that MLJ proposes compatible models?

Thanks a lot,
L.

I’d suggest to do this step in a custom preprocessing function that converts Vector{Vector{Float64}} to a table (e.g. DataFrame) that can be given as input to any model. With this you could build a pipeline my_custom_preprocessor |> some_MLJ_model (see e.g. here).

Unfortunately, this does not solve my issue, which is that the output of the following code is an empty vector, indicating that no model can be used. I was wondering if there is a workaround to enable me to use MLJ?

design_ = [vec(rand(3)) for _ in 1:27] #for reproducibility
field_ = [vec(rand(1359)) for _ in 1:27]

df = DF.DataFrame(param=design_, field=field_)

schema(df) |> display

df, df_test = partition(df, 26.0/27.0);

y, X = unpack(df, ==(:field));
y_test, X_test = unpack(df_test, ==(:field));

m = models(matching(X,y))

Leads to

┌───────┬────────────────────────────┬─────────────────┐
│ names │ scitypes                   │ types           │
├───────┼────────────────────────────┼─────────────────┤
│ param │ AbstractVector{Continuous} │ Vector{Float64} │
│ field │ AbstractVector{Continuous} │ Vector{Float64} │
└───────┴────────────────────────────┴─────────────────┘
NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}[]

You should match on the processed data, e.g.
m = models(matching(my_custom_preprocessor(X), my_custom_targetprocessor(y)))
with e.g. my_custom_preprocessor(x) = DataFrame(hcat(x...)', :auto) and
my_custom_targetprocessor = my_custom_preprocessor or
my_custom_targetprocessor(y) = getindex.(y, 1)