How Do I Most Effectively Use Multiple Dispatch in Data Science Workflows?

Hi all,

Say if I have two DataFrames: df1 and df2. They have unique values but have the same column names: Cost, Customer, and Product. As I move forward with my data processing, sometimes, I want to harmonize the DataFrames using the Customer and Cost via a join and then run a specific set of analyses for this harmonization. Other times, I want to harmonize the DataFrames across Customer and Product via a join and run a different set of analyses. What would be the best approach to handle different analyses for manipulating these datasets?

My thought was if I want to utilize multiple dispatch for this analysis I would:

  1. Create one function called harmonize with the following arguments: harmonize(df_1, df_2)
  2. To accomplish different analyses as stated above, I would then dispatch on harmonize with these two dispatches (Please see Edit 1 for more details on these functions):
    1. harmonize(df_1, df_2 ; analysis::Symbol = :CustomerCost)
    2. harmonize(df_1, df_2 ; analysis::Symbol = :CustomerProduct)

So far, this is working for my analysis. However, I was wondering if this is an abuse of multiple dispatch or an incorrect way of thinking of using multiple dispatch for data science. What are people’s thoughts on my proposed pipeline for analysis?

Thank you!

~ tcp :deciduous_tree:

P.S. If you want me to add any more clarity/information to this post, let me know. I somewhat struggled to articulate what I was trying to say here.

Edit 1:

The two functions would be dispatches that look like this:

function harmonize(df_1, df_2 ; analysis::Symbol = :CustomerCost)

# Code which does the analysis for a Customer and Cost join

end

and

function harmonize(df_1, df_2 ; analysis::Symbol = :CustomerProduct)

# Code which does the analysis for a Customer and Product join

end

In this example, there is no if-else logic happening.

If I understand you correctly you’re not using multiple dispatch at all, as you are just picking the analysis based on the value of a keyword argument (which doesn’t participate in dispatch)?

I.e. your function looks like

function harmonize(df_1, df_2; analysis = :CustomerCost)
    if analysis == :CustomerCost
         df = leftjoin(df_1, df_2, on = [:Customer, :Cost]
         ... analysis on df ...
    elseif analysis == :CustomerProduct
        df = leftjoin(df_1, df_2, on = [:Customer, :Product]
        ... analysis on df ...
    else
        error("analysis must be either CustomerCost or CustomerProduct")
    end
end

is that right? If so, I don’t see anything abusive, it’s purely a matter of taste. Depending on how similar the analyses are, you could also pass the grouping variables to the function, i.e. have

harmonize(df_1, df_2; grouper = [:Customer, :Cost])

Just because multiple dispatch is a central feature of Julia, doesn’t mean it has to be used for everything :slight_smile:

Personally I find that for this type of data analysis workflow I’m fine with writing a pretty plain script without any type annotations, but interested to hear what others think!

4 Likes

Hey @nilshg - thanks for commenting! I added some more clarity to my original post with Edit 1. I am hoping that that gives more description about what functionality I was using. Do you have any thoughts based on my update?

Edit 1 is not going to work, since there is no multiple dispatch here, just two overlapping definitions.

Proper multiple dispatch is something like

abstract type Harmonize end
struct CustomerProduct <: Harmonize end
struct CustomerCost <: Harmonize end

harmonize(df1, df2, ::Harmonize) = throw("Unknown harmonization")
function harmonize(df1, df2, ::CustomerCost)
...
end


function harmonize(df1, df2, ::CustomerProduct)
...
end

which should be used like

harmonize(df1, df2, CustomerProduct())

But I agree with nilshg here, multiple dispatch shouldn’t be used just because it exists. Ordinary if else is more than adequate for this task.

2 Likes

In the case presented, I don’t think if else is even needed. Why not simply

function harmonize(df_1, df_2; analysis = :Cost)
         leftjoin(df_1, df_2, on = [:Customer, analysis]
end
harmonize(df_1,df_2; analysis = :Cost)
harmonize(df_1,df_2; analysis = :Product)

I guess, there is processing part which is not written in example and it is different for different scenarios.

Good point

This was essentially my second suggestion above, just passing the grouping variables as a parameter to the analysis function.

I guess the overarching question is as always which parts of the workflow are sufficiently generic and repeatable to be factored out into their own functions and reused at different places. And then if they are very generic and useable, they might even turn into packages to be factored out, unit tested, and pulled in just doing using MyAnalysisTools at some point (fwiw I’ve never reached that stage :slight_smile: )

Ah, whoops! I confused the idea of overlapping definitions with Multiple Dispatch. I got a bit confused when looking at some example code thinking this was a way to do multiple dispatch.

Thank you for the example Skoffer! That aids a lot more with my understanding of what I am working on.

I agree with your rationale about using if/else blocks. However, what made me interested in trying to use multiple dispatch was the fact that some of these analyses are rather long and I was trying to better organize my code base. Is it reasonable to seek out multiple dispatch for code organization?

You hit the nail on the head @nilshg ! That is certainly the overarching question here and I suppose if I were to rephrase my earlier question about how to effectively use multiple dispatch for data science, it would be “how should multiple dispatch be used for organizing data science work?”

If we’re talking about code organization, then let’s be a little more specific: multiple methods can be used to organize a codebase so that everything isn’t in a single function.

If you are executing different methods depending on the types of your input, then multiple dispatch is the best way to write different methods for each type.

If you are executing different methods depending on the values of your input, then multiple dispatch isn’t the right tool. Here, plain old “value re-routing” from any language will do. For example:

f(x, y; selector = :A) = (selector == :A) ? f_A(x, y) : f_B(x, y)

f_A(x, y) = ...
f_B(x, y) = ...

Sometimes there is no choice but to use a bunch of cases. You can lift the cases into the type system using Val or defining your own singleton types, but I would argue that all of this is equally messy, and Val can make your code unnecessarily opaque. There aren’t many cases where if/else is inherently inferior to the alternatives.

If you find yourself in this situation, then either it is intrinsic to the code you are writing, or you could refactor your code to avoid this if/else. Refactoring to be more generic like the grouper keyword above is usually the right solution. Closures are an effective way of doing this. For example, if f above written as

function f_A(x, y)
    # code specific to A produces a result

    # generic code consumes result
end

function f_B(x, y)
    # code specific to B produces a result

    # generic code consumes result
end

Then you can write it as

select_A(x, y) = # specific to A
select_B(x, y) = # specific to B
function f(x, y; selector = select_A)
    result = selector(x, y)
    # code consumes result and produces final output
end

So basically, instead of passing in a keyword symbol that does X, Y, and Z depending on the symbol, write X, Y, and Z as separate functions. Then rewrite the original function to accept a function instead of a symbol.

6 Likes

Great solution and explanation! Small note though, I don’t think that the last example is an example of a closure. A closure “closes” around a function and stores a part of it’s environment. In this case, it could be a closure but doesn’t have to be.

2 Likes

Good point! Classic mistake where I change my mind half way through the example and don’t update what I wrote previously.

So the addendum to my example: passing in functions as arguments doesn’t have anything to do with closures. But if you write your code to accept functions as arguments, then closures can be a way for the caller to pass in a “custom” function that closes over a bunch of variables in the caller’s scope. Often, I find that functions with lots of keyword arguments and “selector” switches can have all those customizations abstracted away into a function argument. Then closures let you write the function to pass in for that argument in a way that matches a generic interface.