Alternative to map() for grouped dataframe?

I’m trying to do one quadratic fit per subdataframe on a grouped dataframe (using two of it’s columns obviously), but i can’t use map() to do it because it’s reserved, and doing a list comprehension throws an error because i’m working with views. How should i do this?

it would be easier if you post a semi-runnable snippet of what you’re trying to do

I thought that would be too difficult for some reason… heh
Working on a MWE right now

using DataFrames
using EasyFit

data = DataFrame(throw = repeat(1:5, inner=10), t = repeat(1:10, 5)), x = repeat((1:10).^2, 5)
data_gdf = groupby(data, :throw)

fits = map(data_gdf) do sdf
    time, distance = sdf[!, :t], sdf[!, :x]
    fitquad(time, distance)
end

Output:

ArgumentError: using map over `GroupedDataFrame`s is reserved

DataFrames no longer allows map over GroupedDataFrames. Also, the data in the MWE wasn’t usable (ERROR: Could not obtain any successful fit, probably the data is not well posed).

So, use a comprehension.

using DataFrames, EasyFit, Random

Random.seed!(1)                    # for reproducibility

a = repeat(1:5, inner=20)          # 5 groups, 20 points each
t = repeat(range(0, 10; length=20), 5)  # same t grid per group

# true quadratic y = 2t^2 - 3t + 1 plus small noise
x = 2 .* t.^2 .- 3 .* t .+ 1 .+ 0.1 .* randn(length(t))

df = DataFrame(a = a, t = t, x = x)
data_gdf = groupby(df, :a)

fits = [fitquad(sdf.t, sdf.x) for sdf in data_gdf]
summary_fits = [(a = fit.a, b = fit.b, c = fit.c) for fit in fits]
5-element Vector{@NamedTuple{a::Float64, b::Float64, c::Float64}}:
 (a = 2.000993698655384, b = -3.016720982558867, c = 1.0089476616480098)
 (a = 1.9991770713827897, b = -2.994078426802373, c = 0.9818210102989037)
 (a = 1.9994283879614307, b = -2.998494794197106, c = 1.0155535104300808)
 (a = 2.0023161744639304, b = -3.029790779480457, c = 1.095258850137451)
 (a = 2.002728192866932, b = -3.0281152790428805, c = 1.0897536120125266)
1 Like

I think map supports pairs, so things like this should work:

julia> fits = map(pairs(data_gdf)) do (k, sdf)
           fitquad(sdf.t, sdf.x)
       end;

julia> summary_fits = [(a = fit.a, b = fit.b, c = fit.c) for fit in fits]

I’d probs just go for it in one shot with either base DataFrames.jl:

julia> summary_fits = combine(groupby(df, :a)) do sdf
           fits = fitquad(sdf.t, sdf.x)
           (fits_a = fits.a, fits_b = fits.b, fits_c = fits.c)
       end
5×4 DataFrame
 Row │ a      fits_a    fits_b     fits_c
     │ Int64  Float64  Float64   Float64
─────┼────────────────────────────────────
   1 │     1  2.00099  -3.01672  1.00895
   2 │     2  1.99918  -2.99408  0.981821
   3 │     3  1.99943  -2.99849  1.01555
   4 │     4  2.00232  -3.02979  1.09526
   5 │     5  2.00273  -3.02812  1.08975

or DataFramesMeta.jl though:

julia> summary_fits = @by df :a @astable begin
           fits = fitquad(:t, :x)
           :fits_a = fits.a
           :fits_b = fits.b
           :fits_c = fits.c
       end
5×4 DataFrame
 Row │ a      fits_a   fits_b    fits_c
     │ Int64  Float64  Float64   Float64
─────┼────────────────────────────────────
   1 │     1  2.00099  -3.01672  1.00895
   2 │     2  1.99918  -2.99408  0.981821
   3 │     3  1.99943  -2.99849  1.01555
   4 │     4  2.00232  -3.02979  1.09526
   5 │     5  2.00273  -3.02812  1.08975
1 Like

I would help but I get an error with easyfit


julia> fits = [fitquad(sdf.t, sdf.x) for sdf in data_gdf]
5-element Vector{EasyFit.Quadratic{Float64, Float64, Float64, Float64, Float64}}:
Error showing value of type Vector{EasyFit.Quadratic{Float64, Float64, Float64, Float64, Float64}}:

SYSTEM (REPL): showing an error caused an error
ERROR: 1-element ExceptionStack:
UndefVarError: `f` not defined in `EasyFit`
Suggestion: check for spelling errors or missing imports.
MethodError: no method matching fitquadratic(::SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}, ::SubArray{Union{Missing, Float64}, 1, Vector{Union{Missing, Float64}}, Tuple{Vector{Int64}}, false})
The function `fitquadratic` exists, but no method is defined for this combination of argument types.

i think this error has to do with the dataframe display. because if you unnest the column or use ; to suppress the output the error should go away

UndefVarError: `f` not defined in `EasyFit`

in addition the the many solutions above, you could use TidierData.jl (the main branch at this time, still unreleased) and @unnest_wider

julia> @chain df begin
                  @group_by(a)
                  @summarize(model = fitquad(t, x))
                  @unnest_wider(model)
              end
5×9 DataFrame
 Row │ a      model_a  model_b   model_c   model_R2  model_x                          ⋯
     │ Int64  Float64  Float64   Float64   Float64   Array…                           ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │     1  2.00099  -3.01672  1.00895   0.999996  [0.0, 0.10101, 0.20202, 0.30303, ⋯
   2 │     2  1.99918  -2.99408  0.981821  0.999996  [0.0, 0.10101, 0.20202, 0.30303,
   3 │     3  1.99943  -2.99849  1.01555   0.999998  [0.0, 0.10101, 0.20202, 0.30303,
   4 │     4  2.00232  -3.02979  1.09526   0.999997  [0.0, 0.10101, 0.20202, 0.30303,
   5 │     5  2.00273  -3.02812  1.08975   0.999998  [0.0, 0.10101, 0.20202, 0.30303, ⋯
                                                                      4 columns omitted

1 Like

This is because EasyFit apparently does not support arrays that allow missing values, even if they do not contain any, due to type parameter constraints. The problem is not that it’s a view, as you’d said in the top post. You can do dropmissing! to remove any missing values before passing the data to fitquad, or just disallowmissing! if you don’t already have missing values.

For the sake of a reproducible example, adding an allowmissing!(df, :t, :x) to a similar setup to what @technocrat provided above:

julia> disallowmissing!(df);

julia> combine(groupby(df, :throw), [:t, :x] => function (t, x)
                   fit = fitquad(t, x)
                   return (; fit.a, fit.b, fit.c)
               end => AsTable)
5×4 DataFrame
 Row │ throw  a        b         c
     │ Int64  Float64  Float64   Float64
─────┼────────────────────────────────────
   1 │     1  2.00099  -3.01672  1.00895
   2 │     2  1.99918  -2.99408  0.981821
   3 │     3  1.99943  -2.99849  1.01555
   4 │     4  2.00232  -3.02979  1.09526
   5 │     5  2.00273  -3.02812  1.08975
1 Like