I’m using a map+reduce combination to apply a function returning a DataFrame and combine all the results with vcat,
f(p) = DataFrame(x = collect(1:10), y = p[1]*rand(10), z = p[2]*rand(10))
tmp = map(f, ([1,2], [3,4], [5,6]))
all = reduce(vcat, tmp, source="id")
the allocation of temporary results in tmp isn’t ideal, as ?mapreduce suggests, but I’ve been unable to find how to pass the equivalent source = "id" argument in mapreduce, to keep track of each block’s origin. Any idea?
In this context, vcatjust knows about the two arguments it is given. Since each vcat takes two arguments in this example, source can only take the values of 1 or 2. cols = :intersect tells vcat to just keep columns that are in both data frames… I’m surprised that works tbh. I would think it would throw an error.
Also, maybe source isn’t the right move here. Why not just add the parameters directly to the data frame? In julia you can have vectors of vectors. No need to worry about mapping ids to parameters when you can just store the parameters.
Yeah, I’m also a bit confused by my trial and error (:union fails, for example). Clearly this won’t work. I think I finally understand the relation between reduce and vcat for DataFrames, which I had mistakenly read the other way around in the source code. It’s the vcat() method that calls reduce() and passes it the source argument – because reduce has all the objects it can create the IDs, as you say, whereas vcat() on its own only gets passed x and y. It’s confusing because vcat() is defined as
I liked the conciseness of it (no need to create an anonymous function to add the “id” (or all the parameters for that call directly, alternatively)), but that’s purely an aesthetic preference, which I wouldn’t even think about if I hadn’t been using purrr::pmap_df() for years (plyr::mdply() before that).