What is a good design pattern for developing parallel types, one designed for safety and the other for performance?

philip · August 12, 2019, 4:53pm

Let’s say I plan to do something a billion times, like simulate draws from a probability distribution.

I want to ensure that my code is correct by writing a bunch of checks into the type constructor. E.g.

julia> struct VectorPD
           events::Vector{T} where T
           probabilities::Vector{Float64}
           function VectorPD(es,ps)
               if abs(sum(ps) - 1.0) >= 0.00001 
                   error("Probabilities must sum to 1")
               elseif any(ps .< 0.0)
                   error("Probabilities must be nonnegative")
               else
                   new(es,ps)
               end
           end
       end
julia> VectorPD([1,2,3],[0.0,0.5,0.5])
VectorPD([1, 2, 3], [0.0, 0.5, 0.5])

julia> VectorPD([1,2,3],[0.0,0.5,0.6])
ERROR: Probabilities must sum to 1
Stacktrace:
 [1] VectorPD(::Array{Int64,1}, ::Array{Float64,1}) at ./none:6
 [2] top-level scope at none:0

julia> VectorPD([1,2,3],[-0.1,0.5,0.6])
ERROR: Probabilities must be nonnegative
Stacktrace:
 [1] VectorPD(::Array{Int64,1}, ::Array{Float64,1}) at ./none:8
 [2] top-level scope at none:0

Great.

Now I want another similar type that doesn’t slow me down by performing the (perhaps costly, and again executed a billion times) safety check.

So I can do

julia> struct UnsafeVectorPD
           events::Vector{T} where T
           probabilities::Vector{Float64}
       end

julia> UnsafeVectorPD([1,2,3],[0.0,0.5,0.5])
UnsafeVectorPD([1, 2, 3], [0.0, 0.5, 0.5])

julia> UnsafeVectorPD([1,2,3],[0.0,0.5,0.6])
UnsafeVectorPD([1, 2, 3], [0.0, 0.5, 0.6])

julia> UnsafeVectorPD([1,2,3],[-0.1,0.5,0.6])
UnsafeVectorPD([1, 2, 3], [-0.1, 0.5, 0.6])

Now say I want to run a billion tests of my code. Say my code refers to the concept of a probability distribution a lot, in all sorts of places.

What I’d like is to be able to instruct the code at a high level to, everywhere in all the different functions that use a probability distribution, use either the safe or unsafe type, depending on what I’m trying to do—i.e., depending on whether I’m testing that the code probably isn’t completely wrong by using the safe type on a small sample, or computing the actual results on a large sample but with the unsafe type:

    run_stuff(1:1_000, VectorPD)
    run_stuff(1:1_000_000_000, UnsafeVectorPD)

Of course, since the 1B case is going to be run a lot of times, I want the methods to be fast.

And because these things are used all throughout the code, I don’t want to redefine two versions of every method that depends on a probability distribution. E.g. I could do

    function run_stuff_safe(...)
        ...
        dependency_safe(...)
        ...
    end
    function dependency_safe(...)
        ...
        dependency_of_dependency_safe(...)
        ...
    end
    ... #etc etc etc
    function final_dependency_safe(...)
        return VectorPD( ... )
    end

and an analogous chain of unsafe versions. But then I’m maintaining two parallel but essentially identical chunks of code.

I could pass the type all the way through from the top level to the bottom level, as an argument, but that seems almost as tedious. All the intermediate functions don’t need to know about which type to use; only the “bottom” one does.

And I (think?) I don’t want to have to use a macro throughout all these dependencies. Though maybe this is the solution–though I can’t think of how.

The other thing I’ve thought of is defining some high-level global reference to the type, and switching it.

    function final_depenency(...)
        global TypeToUse
        return TypeToUse(...)
    end
    TypeToUse = VectorPD
    run_stuff(1:1_000)
    TypeToUse = UnsafeVectorPD
    run_stuff(1:1_000_000_000)

But there again that seems like a poor idea for the obvious reasons.

Any suggestions? I think I’m probably missing something obvious. (NB, in the real use case, it would not be one, but a handful of types that would come in “safe” and “unsafe” flavors and need to be seamlessly swapped in where appropriate.)

e3c6 · August 12, 2019, 5:35pm

Maybe you could try to replicate the @boundscheck and @inbounds logic in Base. I’d actually be interested in learning how we could code macros like this for different kinds of checks.

For the moment perhaps you could just use @boundscheck and @inbounds, even if your checks have nothing to do with indexing.

Per · August 12, 2019, 5:46pm

Here’s a suggestion: Define a function

safe_mode() = true

Then put your checks inside if statements:

if safe_mode()
   # safety checks go here
end

When you want to disable the safety checks, re-define the function

safe_mode() = false

This has to be done at the top level, not from within a function.

The next time you call a function that calls safe_mode() that function will be re-compiled, and constant-folding will remove the safety-checks entirely.

philip · August 12, 2019, 6:47pm

This is an interesting idea; thanks.

Would you file this under the heading of “design pattern” or “hack”? It seems like a very clever use of dispatch and compilation. I think it might be suitable for what I’m trying to do, provided it’s not too clever.

JeffreySarnoff · August 12, 2019, 7:47pm

if state() 
   in_state()
end

that pattern is not going anywhere – and there is nothing really unusual about it
this comes up in connection with removable @asserts.

phlavenk · August 12, 2019, 8:12pm

Look into stdlib. Logging section. I write one version of the function with all the safety/debug stuff decorated in @debug begin… End section that is ommited by default unless I run the code in a with_logger() do… End statement where a second (full) version is compiled and executed.

c42f · August 12, 2019, 9:47pm

I think this is what you need to do for a clean solution — either implicitly or explicitly pass the information along the call chain to the inner function dependency. You can pass the “am I in debug mode?” implicitly by stashing it in the task local storage, or explicitly by passing it as a variable. I don’t think @debug is ideal for this (it’s meant to emit a message, not optionally run arbitrary code), though you can try it out (see discussion here Allow empty @logmsg · Issue #29672 · JuliaLang/julia · GitHub). Better than that, just use the task local storage yourself.

Another option which might appeal more is to monkey patch your module using the fact that you can redefine methods. This makes sense if you occasionally want to run your code in debug mode (as a global decision across all tasks), but don’t want to put up with the performance hit of looking at a global flag in an inner loop. Here’s the trick:

module A

function debug(debugmode)
    name = debugmode ? VectorPD : UnsafeVectorPD
    @eval make_vector_pd(args...) = $name(args...)
end

debug(false)  # define make_vector_pd as unsafe by default

# methods using `make_vector_pd` rather than using `VectorPD` directly

end

A.debug(true)
# compiler will now recompile any methods depending on make_vector_pd
A.debug(false)
# back to release mode

philip · August 13, 2019, 3:53am

Great stuff. Very interesting.

This looks similar in its effects to @Per’s suggestion, except wrapping in a module.

I’m not sure how to reason about the differences and the implied tradeoffs. (In particular I didn’t know what task local storage was until looking it up just now, and I don’t think my cursory read through the docs is enough to reason about it.)

From your perspective, is there a reason to favor one approach (function defined at module scope + forcing redefinition of method) over @Per’s (function defined at global scope + forcing recompilation of method)?

philip · August 13, 2019, 4:06am

OK–I read through some of the code and some of the PRs on this. Very interesting. Not clear to me yet what’s the best approach, but I will tinker.

c42f · August 13, 2019, 4:12am

Oops my mistake, I’ve been skim reading too much and I missed that @Per’s solution defined a function.

In that case it is functionally the same as my suggestion to use A.debug() with exactly the same tradeoffs:

Pro: There is no performance penalty in release mode. The decision about which branch to take is compiled away by the compiler.
Con: This is a global decision for the entire process. You can’t have one task using the debug mode and another using the release mode.
Con (probably minor): More work for the compiler.

If you do need different tasks to make different decisions about debug mode, then

If you want the absolute best performance, you will need to pass the information explicitly (or the “extreme” option: Cassette)
If you’re not making the decision in an inner loop, you can use pass the information in task local storage and make a dynamic decision.

cserteGT3 · August 13, 2019, 6:42am

It’s possible that I’m missing something, but why don’t you define an abstract type?
Something like:

abstract type AbstractVectorPD end

struct SafeVectorPD <: AbstractVectorPD
	events::Vector{T} where T
	probabilities::Vector{Float64}
	function VectorPD(es,ps)
		if abs(sum(ps) - 1.0) >= 0.00001 
		   error("Probabilities must sum to 1")
		elseif any(ps .< 0.0)
		   error("Probabilities must be nonnegative")
		else
		   new(es,ps)
		end
	end
end

struct UnSafeVectorPD <: AbstractVectorPD
	events::Vector{T} where T
	probabilities::Vector{Float64}
end

Then you should define your functions to accept AbstractVectorPDs:

function run_any_stuff(v::AbstractVectorPD)
    dependency(...)
end

If you have a function that must behave differently, then you can restrict the types:

function do_work(v::UnSafeVectorPD)
     #unsafe but fast
end

function do_work(v::SafeVectorPD)
     #safe, but slow
end

c42f · August 13, 2019, 6:48am

This is good advice in general and was also my first instinct. It’s just that the OP specified that the extra debugging checks were to be deeply nested within the implementation and they would rather not pass the debug flag (or equivalent type information) explicitly down a deep call tree.

philip · August 13, 2019, 1:42pm

Yes–in fact I do do this, for all the safe/unsafe type pairs. So in the signature, all the functions expect the abstract type and operate on it.

The issue is that in my current implementation I also pass the “choice” of type as an extra argument so that down the chain when the type is actually instantiated, the bottom-level instantiating function knows the type.

function run_stuff(...,choice_of_concrete_type)
    x = get_pd(...,choice_of_concrete_type)
    do_some_stuff(x)
end
function do_some_stuff(x::AbstractPD)
    y = expectation(x)
    ...
end
function get_pd(...,choice_of_concrete_type)
    ...
    return choice_of_concrete_type(...)
end

So e.g. you’d call either runstuff(...,VectorPD) or runstuff(...,UnsafeVectorPD).

Unfortunately (as far as I can tell) the abstract wrapper doesn’t bear on this problem at all, hence why I left it out.

philip · August 13, 2019, 1:43pm

**Exactly.

philip · August 13, 2019, 1:46pm

These cons are entirely acceptable. I’ll give it a whirl. Thanks!

Topic		Replies	Views
How to maintain performance despite type instability when you know the type? General Usage question	5	964	August 3, 2017
Improving performance elegantly - type stability General Usage type-stability	20	1471	January 13, 2019
Strict typing and restricting types in a method's signature General Usage	25	6311	December 3, 2016
Tutorial on using advanced type system in Julia? New to Julia	28	3189	January 27, 2020
Avoiding Dynamic dispatch General Usage	11	4387	October 13, 2017

What is a good design pattern for developing parallel types, one designed for safety and the other for performance?

Related topics