Dispatch and Symbols

Apologize if this is trivial but couldn’t find a solution reading the manual:

module easy
abstract type Math end
struct Mul <: Math end
struct Div <: Math end

function op(a, b, ::Mul)
    return a*b
end
function op(a, b, ::Div)
    return a/b
end

export op
export Mul
export Div

end # module

How may I change the above module so that instead of

julia> op(2,3,Mul())
6

would be able to do

Julia> op(2, 3, :Mul)

(Mul, Div,… are actually names of DataFrame columns that were previously read)
Thank you!

1 Like
function op(a, b, ::Type{Mul})
    return a*b
end

julia> op(2, 3, Mul)
6

While the above does not dispatch on :Mul, it may be of use.

function op(a, b, ::Val{:Mul})
    return a * b
end

julia> op(2, 3, Val(:Mul))
6

if the structs are not defined

const dataframe_col_name = :Mul
const Mul  = Val(dataframe_col_name)

function op(a, b, ::Val{dataframe_col_name})
    return a * b
end

julia> op(2, 3, Mul)
6
4 Likes

Beautiful! Thank you very much!

A simpler and less repetitious approach is to use a dictionary. Think about how many places you need to change if you add another operation. I see three in your code. I like to strive for just one. A dictionary is not as efficient as dispatch, but perhaps you don’t need that. (You could of course use if statements instead of a dictionary for better performance.)

const op = Dict(
    :Mul => (a,b) -> a * b,
    :Div => (a,b) -> a / b)

Then:

julia> op[:Mul](2, 3)
6
2 Likes

A dictionary is much faster than dynamic dispatch which is what is lwhat will happen here.

But switching on an e.g. an enum with all the different opcodes will most likely be the fastest.

4 Likes

Could you give a small example of this approach?

But switching on an e.g. an enum with all the different opcodes will most likely be the fastest.

Something like

@enum OpCode begin
   MUL
   DIV
end

function evaluate_op(op::OpCode, a, b)
    if op == MUL
        return a * b
    elseif op == DIV
       return a / b
    end
    error("should not happen")
end
julia> evaluate_op(MUL, 2, 3)
6

It’s pretty easy to metaprogram the big switch if one wants a bit neater code.

5 Likes

Yep that’s the if solution I referred to. But the point I was trying to make was more to think about if you really need to eke every last nanosecond out of this evaluation? If not, I’d primarily focus on simple, DRY, readable code.

Btw, in the suggestion above, the error prevents SIMDing the evaluation. If you can guarantee the validity of the input, and the surrounding code also allows for SIMD, you can speed up the evaluation greatly by removing that check. In fact, even if you can’t guarantee the input, you can still do much better by first looping over the data to ensure that it’s valid (which will be fast, since there will be virtually zero branch mispredictions), and then loop over and evaluate the operation. Sample code:

@enum OpCode begin
    MUL
    DIV
end

# SIMD unfriendly
function evaluate_op(op::OpCode, a, b)
    if op == MUL
        return a * b
    elseif op == DIV
        return a / b
    end
    error("should not happen")
end

# SIMD friendly
function evaluate_op2(op::OpCode, a, b)
    if op == MUL
        return a * b
    else
        return a / b
    end
end

N = 100_000
a, b, v = 1:3 .|>_-> rand(N)
ops = rand([MUL, DIV], N)

Timings:

julia> @btime $v .= evaluate_op.($ops, $a, $b);
  393.465 μs (0 allocations: 0 bytes)

julia> @btime $v .= evaluate_op2.($ops, $a, $b);
  57.408 μs (0 allocations: 0 bytes)

Optionally, verifying input:

julia> @btime for op = $ops; op < MUL || op > DIV && error("bad!"); end
  39.312 μs (0 allocations: 0 bytes)
3 Likes

Btw, in the suggestion above, the error prevents SIMDing the evaluation.

  • Using only 2 opcodes might make the conclusions not apply (your version seems to have completely avoided using a lookup table at all, which seems special cased for 2).
  • I doubt this actually SIMDs even without error. Do you actually see SIMD instructions?

thanks a lot!

Yes it SIMDs. First, there’s an easy and mostly accurate way to test if an expression SIMDs or not without scrutinizing the code: run it on data with half the number of bits. Properly SIMDed code should then see a close to 2x speedup. Testing with the above code, first 64 bit floats:

julia> N = 100_000; ops = rand([MUL, DIV], N);

julia> a, b, v = 1:3 .|>_-> rand(N);

julia> @btime $v .= evaluate_op.($ops, $a, $b);
  393.465 μs (0 allocations: 0 bytes)

julia> @btime $v .= evaluate_op2.($ops, $a, $b);
  57.408 μs (0 allocations: 0 bytes)

Now 32 bit floats:

julia> a, b, v = 1:3 .|>_-> rand(Float32, N);

julia> @btime $v .= evaluate_op.($ops, $a, $b);
  403.560 μs (0 allocations: 0 bytes)

julia> @btime $v .= evaluate_op2.($ops, $a, $b);
  27.580 μs (0 allocations: 0 bytes)

So evaluate_op2 most likely SIMDs. To verify, and see how:

julia> @code_native syntax=:intel broadcast!(evaluate_op2, v, ops, a, b)
...
L1536:
	vmovups	ymm1, ymmword ptr [rdx + 4*rbx]
	vmovups	ymm2, ymmword ptr [r11 + 4*rbx]
	vpcmpeqd	ymm3, ymm0, ymmword ptr [rcx + 4*rbx]
	vdivps	ymm4, ymm1, ymm2
	vmulps	ymm1, ymm1, ymm2
	vblendvps	ymm1, ymm4, ymm1, ymm3
	vmovups	ymmword ptr [rsi + 4*rbx], ymm1
	add	rbx, 8
	cmp	rdi, rbx
	jne	L1536
...

It’s quite clever, what this code does: It loads 256 bits of values at a time from each of the a and b input vectors (on my AVX2 laptop, newer CPUs would presumably work with 512 bits). That is 4x 64 bit floats, or 8x 32 bit floats. It then both multiplies and divides all values with each other, and finally selects (blends) either the product or quotient depending on the operation for each value.

Although only four 64 bit floats are processed at a time, and although the code does additional work (always both multiplies and divides), one might wonder why it’s a whopping 7 times faster than the non-SIMD version. The answer is that the code above is completely branchless (apart from the loop condition), whereas the non-SIMD version will incur constant costly branch misses. To see that this is the case, we can use a smaller vector, so that the CPU will learn the branching behavior during the benchmark:

julia> N = 1000; ops = rand([MUL, DIV], N);

julia> a, b, v = 1:3 .|>_-> rand(N);

julia> @btime $v .= evaluate_op.($ops, $a, $b);
  1.211 μs (0 allocations: 0 bytes)

julia> @btime $v .= evaluate_op2.($ops, $a, $b);
  572.929 ns (0 allocations: 0 bytes)

Now, without branch mispredictions, the SIMD version is only ~2x faster than the SIMD version, which intuitively makes sense (processes 4 values per iteration, but does 2x the amount of work).

And no, this particular pattern won’t scale very well if more operations are added.

Sorry, long post, probably irrelevant to the OP, I got a bit carried away there.

10 Likes