Mapreduce, pass extra arguments to reduce/vcat of DataFrames

baptnz · March 3, 2022, 11:02am

I’m using a map+reduce combination to apply a function returning a DataFrame and combine all the results with vcat,

f(p) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10))  
tmp = map(f, ([1,2], [3,4], [5,6]))
all = reduce(vcat, tmp, source="id")

the allocation of temporary results in tmp isn’t ideal, as ?mapreduce suggests, but I’ve been unable to find how to pass the equivalent source = "id" argument in mapreduce, to keep track of each block’s origin. Any idea?

Henrique_Becker · March 3, 2022, 2:13pm

uhh, the Base.reduce has no parameter source except some package you import change this?

baptnz · March 3, 2022, 2:27pm

It’s the method defined for DataFrames,

methods(reduce, DataFrames)

or perhaps more specifically, the source argument originates from methods(vcat, DataFrames) and is passed through reduce in this case.

pdeffebach · March 3, 2022, 3:04pm

Performance isn’t ideal might be the use of a Tuple rather than a Vector. Large Tuples don’t perform well… but that also might not be the issue.

I guess your problem is that there is a secialized method for reduce but not for mapreduce. One way to improve performance might be with a Generator.

You can also use an anonymous. They work just like in R

reduce((a, b) -> ..., dfs)

but it looks like this doesn’t work

julia> ps = [[1, 2], [3, 4], [5, 6]];

julia> mapreduce(f, (a, b) -> vcat(a, b, source = "id"), ps);
ERROR: ArgumentError: column(s) id are missing from argument(s) 2

Given this, i think maybe we should add a method for mapreduce in addition to reduce.

Another solution, along the lines of what I discussed yesterday, is to do more inside an anonymous function

julia> ps = [[1, 2], [3, 4], [5, 6]];

julia> mapreduce(vcat, eachindex(ps)) do i
           df = f(ps[i])
           df.source .= i
           df
       end

bkamins · March 3, 2022, 3:04pm

you can change your f function to create :id column and use mapreduce.

rocco_sprmnt21 · March 3, 2022, 3:21pm

something like this …



f(p) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10), id=string(p[1])) 


mapreduce(f, (x,y)->vcat(x,y), ([1,2], [3,4],[5,6]))

ops … I arrived late

or something like that, if you really want to use library functions



mapreduce(f, (x,y)->vcat(x,y, source=string("id",nrow(x)), cols=:union), ([1,2], [3,4],[5,6]))

baptnz · March 3, 2022, 4:22pm

I’m also a bit puzzled by this; I thought I’d managed to get it to work with

mapreduce(f, (x,y) -> vcat(x,y, source="id", cols=:intersect), ps)

but on my longer example I get some unexpected missing values that I don’t understand at all.

pdeffebach · March 3, 2022, 4:52pm

In this context, vcat just knows about the two arguments it is given. Since each vcat takes two arguments in this example, source can only take the values of 1 or 2. cols = :intersect tells vcat to just keep columns that are in both data frames… I’m surprised that works tbh. I would think it would throw an error.

Also, maybe source isn’t the right move here. Why not just add the parameters directly to the data frame? In julia you can have vectors of vectors. No need to worry about mapping ids to parameters when you can just store the parameters.

baptnz · March 3, 2022, 5:12pm

Yeah, I’m also a bit confused by my trial and error (:union fails, for example). Clearly this won’t work. I think I finally understand the relation between reduce and vcat for DataFrames, which I had mistakenly read the other way around in the source code. It’s the vcat() method that calls reduce() and passes it the source argument – because reduce has all the objects it can create the IDs, as you say, whereas vcat() on its own only gets passed x and y. It’s confusing because vcat() is defined as

Base.vcat(dfs::AbstractDataFrame...;
          cols::Union{Symbol, AbstractVector{Symbol},
                      AbstractVector{<:AbstractString}}=:setequal,
          source::Union{Nothing, SymbolOrString,
                           Pair{<:SymbolOrString, <:AbstractVector}}=nothing) =
    reduce(vcat, dfs; cols=cols, source=source)

suggesting (to my naive eyes) that vcat itself uses source, when in fact it just feeds it to the custom reduce.

baptnz · March 3, 2022, 5:16pm

I liked the conciseness of it (no need to create an anonymous function to add the “id” (or all the parameters for that call directly, alternatively)), but that’s purely an aesthetic preference, which I wouldn’t even think about if I hadn’t been using purrr::pmap_df() for years (plyr::mdply() before that).

rocco_sprmnt21 · March 3, 2022, 5:30pm

in this way it is perhaps an acceptable compromise


f(p,id) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10), id=id) 

mapreduce(t->f(t[2],t[1]), (x,y)->vcat(x,y), enumerate(([1,2], [3,4],[5,6],[7,8])))

this way you don’t have to change your f (p)


F(id,p)=hcat(f(p),DataFrame(id=fill(id,nrow(f(p)))))

mapreduce(t->F(t...), (x,y)->vcat(x,y), enumerate(([1,2], [3,4],[5,6])))

Topic		Replies	Views
Combining lots of DataFrames, best approach? Performance question , dataframes	10	1069	March 25, 2022
Efficiently creating a data frame that is made up of smaller data frames Modelling & Simulations dataframes , for-loop	5	546	September 11, 2022
Convert dictionary of dataframes into single dataframe General Usage	5	1228	March 30, 2021
Map over combinations of parameters, and grouping results as DataFrame General Usage dataframes	10	1415	April 29, 2022
Vcat multiple DataFrames General Usage	1	2165	May 19, 2021

Mapreduce, pass extra arguments to reduce/vcat of DataFrames

Related topics