Mapreduce, pass extra arguments to reduce/vcat of DataFrames

I’m using a map+reduce combination to apply a function returning a DataFrame and combine all the results with vcat,

f(p) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10))  
tmp = map(f, ([1,2], [3,4], [5,6]))
all = reduce(vcat, tmp, source="id")

the allocation of temporary results in tmp isn’t ideal, as ?mapreduce suggests, but I’ve been unable to find how to pass the equivalent source = "id" argument in mapreduce, to keep track of each block’s origin. Any idea?

uhh, the Base.reduce has no parameter source except some package you import change this?

It’s the method defined for DataFrames,

methods(reduce, DataFrames)

or perhaps more specifically, the source argument originates from methods(vcat, DataFrames) and is passed through reduce in this case.

Performance isn’t ideal might be the use of a Tuple rather than a Vector. Large Tuples don’t perform well… but that also might not be the issue.

I guess your problem is that there is a secialized method for reduce but not for mapreduce. One way to improve performance might be with a Generator.

You can also use an anonymous. They work just like in R

reduce((a, b) -> ..., dfs)

but it looks like this doesn’t work

julia> ps = [[1, 2], [3, 4], [5, 6]];

julia> mapreduce(f, (a, b) -> vcat(a, b, source = "id"), ps);
ERROR: ArgumentError: column(s) id are missing from argument(s) 2

Given this, i think maybe we should add a method for mapreduce in addition to reduce.

Another solution, along the lines of what I discussed yesterday, is to do more inside an anonymous function

julia> ps = [[1, 2], [3, 4], [5, 6]];

julia> mapreduce(vcat, eachindex(ps)) do i
           df = f(ps[i])
           df.source .= i
1 Like

you can change your f function to create :id column and use mapreduce.

something like this …

f(p) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10), id=string(p[1])) 

mapreduce(f, (x,y)->vcat(x,y), ([1,2], [3,4],[5,6]))

ops … I arrived late

or something like that, if you really want to use library functions :grinning:

mapreduce(f, (x,y)->vcat(x,y, source=string("id",nrow(x)), cols=:union), ([1,2], [3,4],[5,6]))

I’m also a bit puzzled by this; I thought I’d managed to get it to work with

mapreduce(f, (x,y) -> vcat(x,y, source="id", cols=:intersect), ps)

but on my longer example I get some unexpected missing values that I don’t understand at all.

In this context, vcat just knows about the two arguments it is given. Since each vcat takes two arguments in this example, source can only take the values of 1 or 2. cols = :intersect tells vcat to just keep columns that are in both data frames… I’m surprised that works tbh. I would think it would throw an error.

Also, maybe source isn’t the right move here. Why not just add the parameters directly to the data frame? In julia you can have vectors of vectors. No need to worry about mapping ids to parameters when you can just store the parameters.

1 Like

Yeah, I’m also a bit confused by my trial and error (:union fails, for example). Clearly this won’t work. I think I finally understand the relation between reduce and vcat for DataFrames, which I had mistakenly read the other way around in the source code. It’s the vcat() method that calls reduce() and passes it the source argument – because reduce has all the objects it can create the IDs, as you say, whereas vcat() on its own only gets passed x and y. It’s confusing because vcat() is defined as

          cols::Union{Symbol, AbstractVector{Symbol},
          source::Union{Nothing, SymbolOrString,
                           Pair{<:SymbolOrString, <:AbstractVector}}=nothing) =
    reduce(vcat, dfs; cols=cols, source=source)

suggesting (to my naive eyes) that vcat itself uses source, when in fact it just feeds it to the custom reduce.

I liked the conciseness of it (no need to create an anonymous function to add the “id” (or all the parameters for that call directly, alternatively)), but that’s purely an aesthetic preference, which I wouldn’t even think about if I hadn’t been using purrr::pmap_df() for years (plyr::mdply() before that).

in this way it is perhaps an acceptable compromise

f(p,id) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10), id=id) 

mapreduce(t->f(t[2],t[1]), (x,y)->vcat(x,y), enumerate(([1,2], [3,4],[5,6],[7,8])))

this way you don’t have to change your f (p)


mapreduce(t->F(t...), (x,y)->vcat(x,y), enumerate(([1,2], [3,4],[5,6])))