Map over combinations of parameters, and grouping results as DataFrame

Coming from the R world, I’m wondering how to translate the following task in a more Julian way. I often need to run a function, say my_model(x, a, b, c) for various combinations of parameters a,b,c, where the output of my_model will typically be a DataFrame with length(x) rows. Here’s a R version of this workflow,

library(purrr)
library(dplyr)
library(tidyr)

my_model <- function(x=seq(0,10, length=100), a=1, b=1, c=1, fun = cos){

  # dummy example here
  data.frame(x=x, y = a*sin(b*x) + c, z = a*fun(b*x) + c)
  
}

head(my_model())
# x        y        z
# 1 0.0000000 1.000000 2.000000
# 2 0.1010101 1.100838 1.994903
# 3 0.2020202 1.200649 1.979663
# 4 0.3030303 1.298414 1.954437
# 5 0.4040404 1.393137 1.919480
# 6 0.5050505 1.483852 1.875150

params <- expand.grid(a=c(0.1,0.2,0.3), b = c(1,2,3), c = c(0,0.5))
head(params)
# a b c
# 1 0.1 1 0
# 2 0.2 1 0
# 3 0.3 1 0
# 4 0.1 2 0
# 5 0.2 2 0
# 6 0.3 2 0

all <- pmap_df(params, my_model, fun = tanh, .id = 'id')
str(all)
# 'data.frame':	1800 obs. of  4 variables:
# $ id: chr  "1" "1" "1" "1" ...
# $ x: num  0 0.101 0.202 0.303 0.404 ...
# $ y: num  0 0.0101 0.0201 0.0298 0.0393 ...
# $ z: num  0.1 0.111 0.122 0.135 0.15 ...


# join with the 'metadata'
params$id <- as.character(1:nrow(params))

d <- left_join(params, all, by='id')
str(d)
# 'data.frame':	1800 obs. of  7 variables:
# $ a : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
# $ b : num  1 1 1 1 1 1 1 1 1 1 ...
# $ c : num  0 0 0 0 0 0 0 0 0 0 ...
# $ id: chr  "1" "1" "1" "1" ...
# $ x : num  0 0.101 0.202 0.303 0.404 ...
# $ y : num  0 0.0101 0.0201 0.0298 0.0393 ...
# $ z : num  0.1 0.111 0.122 0.135 0.15 ...

# optional: reshape to long format for visualisation 
m <- pivot_longer(d, cols = c('y','z'))
library(ggplot2)
ggplot(m, aes(x, value, colour=a, linetype=factor(c), group=interaction(a,c))) +
  facet_grid(name~b, scales='free_y', labeller = label_both) +
  geom_line()

I find this workflow very handy and extensible, and because it’s typically for interactive analyses the raw efficiency isn’t too much of a concern (a more realistic my_model may be slow for each iteration, so any slight overhead of manipulating the data this way is negligible).

In Julia, I would likely use a comprehension to loop over the combinations of parameters, e.g.

all = [my_model(x, a,b,c) for a=..., b = ..., c = ...]

and then splat the results together, add repeated versions of the parameters a,b,c, but it’s less streamlined. Am I missing an equivalent to purrr::pmap_df() and dplyr::left_join?
I saw some uses of Base.Cartesian, but it doesn’t feel super intuitive to me.

Many thanks.

This is a complicated question. DataFrames.jl can certainly handle this kind of operation.

I think one thing that’s missing in your workflow is an easy Julia equivalent of expand.grid. This function should suffice

function expand_grid(; kws...)
    names, vals = keys(kws), values(kws)
    return DataFrame(NamedTuple{names}(t) for t in Iterators.product(vals...))
end

Your my_model function can be ported almost directly

function mymodel(;x = range(0, 10, length = 100), a = 1, b = 1, c = 2, fun = sin)
    DataFrame(x = x, y = a .* sin.(b .* x) .+ c, z = a .* fun.(b .* x) .+ c)
end

So here is stuff so far

julia> function mymodel(;x = range(0, 10, length = 100), a = 1, b = 1, c = 2, fun = sin)
           DataFrame(x = x, y = a .* sin.(b .* x) .+ c, z = a .* fun.(b .* x) .+ c)
       end
mymodel (generic function with 6 methods)

julia> params = expand_grid(a=[0.1,0.2,0.3], b = [1,2,3], c = [0,0.5]);

julia> map(eachrow(params)) do r
           mymodel(;fun = tanh, pairs(r)...)
       end;
3 Likes

If I understand you correctly then what I tend to do is:

julia> param1 = [1,2,3]; param2 = ["a", "b"];

julia> df = rename!(DataFrame(IterTools.product(param1, param2)), ["param1", "param2"])

julia> df[!, :result] .= 0.0; df
6×3 DataFrame
 Row │ param1  param2  result  
     │ Int64   String  Float64 
─────┼─────────────────────────
   1 │      1  a           0.0
   2 │      2  a           0.0
   3 │      3  a           0.0
   4 │      1  b           0.0
   5 │      2  b           0.0
   6 │      3  b           0.0

julia> for r ∈ eachrow(df)
           r.result = mymodel(r.param1, r.param2)
       end

(of course result could be anything, i.e. it could hold a NamedTuple with multiple results)

1 Like

Thanks! I’d used Iterators.product in the past (but forgotten what it was called), but I doubt I would have managed to wrap it as neatly as expand_grid here.

With the map over rows, what would be a good way to store the results and combine them all together with the parameters?

DataFrames has leftjoin just like R so your approach in R will also work.

Thanks, I’m curious how you’d handle the case where mymodel returns a DataFrame (with 100 rows here). Preallocating the ‘result’ column seems problematic because it’s the wrong container type and dimension. And if I can allocate it an arbitrary object, what can I use to “unnest” the DataFrame to a flat format at the end? (in R’s tidyverse, that would be:

params$results <- pmap(params, my_model, fun = tanh)
unnest(params, cols='results')
# A tibble: 1,800 × 6
       a     b     c     x      y      z
   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
 1   0.1     1     0 0     0      0     
 2   0.1     1     0 0.101 0.0101 0.0101
 3   0.1     1     0 0.202 0.0201 0.0199
 4   0.1     1     0 0.303 0.0298 0.0294
 5   0.1     1     0 0.404 0.0393 0.0383
 6   0.1     1     0 0.505 0.0484 0.0466
 7   0.1     1     0 0.606 0.0570 0.0541
 8   0.1     1     0 0.707 0.0650 0.0609
 9   0.1     1     0 0.808 0.0723 0.0669
10   0.1     1     0 0.909 0.0789 0.0721
# … with 1,790 more rows

Then I need to have an “id” column in both params (easy enough), and the result of map() – do you see an elegant way of obtaining this?
The brute-force way I can see is to vcat() the result of map(), and then create an id variable with the right number of repeats.


params = expand_grid(a=[0.1,0.2,0.3], b = [1,2,3], c = [0,0.5]);

all = map(eachrow(params)) do r
    mymodel(;fun = tanh, pairs(r)...)
end;

params[!, :id] = 1:nrow(params)

all_df = vcat(all...)
all_df[!, :id] = repeat(1:nrow(params), inner=100)

DataFrames.leftjoin(params, all_df, on=:id)

(feels a bit clunky…)

Don’t splat like that. In Julia, you should never splat large collections. Use reduce(vcat, all) instead.

You could use mapreduce instead of map to get a DataFrame out, combining both the map and the reduce into one call.

all_df[!, :id] = repeat(1:nrow(params), inner=100)

This is a bit scary, as it relies on the length of whatever you are using being 100. Maybe have id be an input into the mymodel function? Just spitballing though.

Thanks – I did see someone warning another that this was inefficient; reduce is basically R’s do.call(rbind, all), with a better name, so I’m happy remembering that.

Edit*: Actually, reduce(vcat, all) is not super intuitive for me, because I would have expected it to be super inefficient (creating recursively larger and larger intermediates until all DataFrames are combined). From ?reduce I see that there’s a special method for vcat-like functions (?) to bypass this problem, presumably.

Thanks – I suspected there’d be an equivalent to map_df (which combines the map + reduce).

Definitely a concern, I would not want to have that in my code. I could modify my_model (just like I could have it include a, b, c in the returned DataFrame), but I’m hoping there’s a natural approach that makes it unnecessary.

I’ve noticed that reduce has an optional source argument specifically to keep track of each block. With this, I now have the following wrapper:


function expand_grid(; kws...)
    names, vals = keys(kws), values(kws)
    return DataFrame(NamedTuple{names}(t) for t in Iterators.product(vals...))
end

function mymodel(;x = range(0, 10, length = 100), a = 1, b = 1, c = 2, fun = sin, kws...)
    DataFrame(x = x, y = a .* sin.(b .* x) .+ c, z = a .* fun.(b .* x) .+ c)
end

function pmap_df(p, f, kws...; join=true)
   tmp = map(f, eachrow(p), kws...)
   all = reduce(vcat, tmp, source="id")
   if !join
    return all
   end
   p[!, :id] = 1:nrow(p)
   return DataFrames.leftjoin(p, all, on=:id)
end

params = expand_grid(a=[0.1,0.2,0.3], b = [1,2,3], c = [0,0.5]);

pmap_df(params, p -> mymodel(;p..., fun=tanh))

That’s pretty close to what I was after; I’d still like to merge the map+reduce steps with mapreduce, but I’m not sure how to pass extra arguments to reduce. Any idea?

I was just searching for a Julia equivalent of R’s expand.grid, and that is exactly what I needed, and in such a lovely concise way. Thanks!