Letting a package use external user-created functions

Michael_Wang · December 24, 2020, 9:17pm

Suppose I have a simulation package

module MySimPackage
    struct Interaction1 <: AbstractInteraction
        stuff
    end

    function compute_interaction(i::Interaction1)
        compute something
    end

    function run_simulation(interactions::Array{AbstractInteraction, 1})
        for interaction in interactions
            compute_interaction(interaction)
        end
    end
end

Now, suppose I want to introduce a new custom interaction CustomInteraction with a corresponding compute_interaction(i::CustomInteraction). One way I can do this is to incorporate this into my package, commit the changes, then download/update my package on the cluster where I want to run my simulations. However, this is a bit tedious especially if I might not want to officially include this function in my package.

Is there a way I can write an external file, say CustomInteractions.jl, and export CustomInteraction and compute_interaction(i::CustomInteraction) in such a way that the package uses these custom structs and functions. I tried

using MySimPackage

export CustomInteraction
export compute_interaction
struct CustomInteraction <: AbstractInteraction
    custom vars
end

function compute_interaction(i::CustomInteraction)
    compute something
end

but I get the error Method error: no method matching compute_interactions(::CustomInteractions)

Michael_Wang · December 24, 2020, 9:27pm

I apologize for the spam. I just needed to write

function MySimPackage.compute_interaction(i::CustomInteraction)

end

Is this the proper way to do it?

lmiq · December 24, 2020, 9:39pm

You can use a local copy of the package, or create a fork or branch in your repo, and load that in a custom environment without having to change the actual package or it’s installation in the main environment. I cannot give you an example now, but that is what you probably want.

Michael_Wang · December 24, 2020, 9:51pm

Thanks for the response. That is certainly one way to do it. However, I want to avoid having the user do too much. Ideally, a user should just be able to add the package to their julia installation and set up their own simulation (including custom interactions) in a separate run file without having to create forks/branches or modifying the package on their end.

lmiq · December 24, 2020, 10:00pm

Ah, in that case you should pass the function that computes the interactions as a parameter (optional, probably) to the simulation function. Then, the user will be able to define custom functions without changing at all the package.

MA_Laforge · December 26, 2020, 2:02am

I am a little confused about where we are in the conversation, but it seems to me like you have not received a direct answer.

@Michael_Wang: What you wrote here looks correct. @lmiq gave you alternative solutions, but I doesn’t appear to me like you were doing anything wrong.

Your solution basically looks like how Julia’s type system was intended to be used. You defined an abstract type and functions in MySimPackage, then used them in your “user” code (the file starting with using MySimPackage). There might be a few bugs in the snippets you wrote, but the overall solution looks fine.

You are basically extending the functionality of MySimPackage by creating more subtypes of AbstractInteraction .

I suppose you could re-package CustomInteractions.jl as an additional simulation-layer package working on top of MySimPackage if you wanted. But sometimes, you just want to use it to solve your specific problem. In that case, you simply write custom code like you did in your example.

MA_Laforge · December 26, 2020, 2:13am

Typos

I suggest you look out for typos. You seem to add “s” at the end of some function names, and don’t add them on others. Watch out for that.

Array/Vector pitfall

   function run_simulation(interactions::Array{AbstractInteraction, 1})
        for interaction in interactions
            compute_interaction(interaction)
        end
    end

Watch out for the above signature ::Array{AbstractInteraction, 1}. That means your function only accepts array of AbstractInteraction, which can be constructed using:

i1=CustomInteraction(...)
i2=CustomInteraction(...)
interactions = AbstractInteraction[i1, i2]

I suggest using the function signature:

function run_simulation(interactions::Array{T, 1}) where T<:AbstractInteraction

That will allow you to create your interactions array using:

i1=CustomInteraction(...)
i2=CustomInteraction(...)
interactions = [i1, i2]

(here, interactions will be of type ::Array{CustomInteraction, 1})

And as a final tweak, I suggest using the Vector{T} alias for Array{T, 1} as it is more succinct:

function run_simulation(interactions::Vector{T}) where T<:AbstractInteraction

Michael_Wang · December 26, 2020, 2:59am

@MA_Laforge Thanks for your response. After I got things working I kind of forgot to check back.

Indeed, the solution proposed by leondromartinez is one way to do it. And of course useful functions should eventually be incorporated into the package itself. The solution I found is for the user (which currently is me) to define CustomInteraction in their own script and the corresponding function using function MySimPackage.compute_interaction().... This, as you say, extends the functionality of the package to take into account CustomInteraction.

In writing the post, I wasn’t too careful with typos! My bad! But in my actual code I made sure things are consistent.

On the subject of parametric types, this is something that I overlooked in my first attempt at the package. I need to overhaul the whole package to do this properly. I do have a question regarding parametric types. In your example

function run_simulation(interactions::Vector{T}) where T <: AbstractInteraction

what happens if interactions is an array of different types of interactions (e.g. CustomInteraction1 and CustomInteraction2). In this case, is the type of interactions just Vector{AbstractInteraction} (I think this is what I found when testing this)? Is it fine to still use your example of parametric types or is there a better way handle this case (such that the abstract type doesn’t appear)?

sylvaticus · December 26, 2020, 3:03am

Underlying the answer of MA_Laforge there is the fact that arrays of ConcreteType are not subtypes of arrays of AbstractType, even when ConcreteType is actually a subtype of AbstractType, hence the need for the template.

MA_Laforge · December 26, 2020, 3:23am

@sylvaticus is correct. If you do not understand what he is saying, I highly suggest reading up on these concepts. If you don’t understand this well enough, your development experience will likely be very frustrating.

Having said that, you are correct:

Your array of [CustomInteraction1(...), CustomInteraction2(...)] will be a Vector{AbstractInteraction}.
But [CustomInteraction1(...), CustomInteraction1(...)] will be a Vector{CustomInteraction1}.

Thus,

function run_simulation(interactions::Vector{T}) where T <: AbstractInteraction

will work on both cases, whereas the version you initially wrote would only work with the first array.

It is almost always bad practice to specialize your function with ::Vector{AbstractInteraction}. When you see this pattern, it usually means the author did not adequately understand Julia’s type system.

Michael_Wang · December 26, 2020, 4:22am

Thanks so much! I understand what @sylvaticus is saying. Indeed the way I have it currently written may lead to errors if the user somehow passes an array that is not of type Vector{AbstractInteraction}. How should I approach the following example.

Suppose I have the function

function compute_interaction(particles_i::Vector{T}, particles_j::Vector{T}) where {T <: AbstractParticle}
    for p_i in particles_i, p_j in particles_j
        p_i.force = p_i.x - p_j.x
    end
end

If I only have one type of particle, say Particle <: AbstractParticle, then this should work fine assuming that earlier in my code my array of particles is of concrete type Vector{Particle}. Suppose a user comes along and decides they want to define a new custom particle CustomParticle <: AbstractParticle. The same compute_interaction() function will apply since all particles should have force and x. However, the user now makes an array of particles of both types so that the array is of type Vector{AbstractParticle}. When this is passed to compute_interaction(), there should be a type instability because Julia will not be able to infer the types of .force and .x since now particle_i or particle_j are now of type AbstractParticle, at least at the time of compiling.

Is there a proper way to deal with this? One way would be to dispatch to functions which act on specific pairs of particle types, but this seems infeasible when you start introducing many different types of particles. Another way is to explicitly write .force::Float64 and .x::Float64. However, in the future, I want to allow the possibility of using Float32 in some simulations.

Sorry if this was long. The solution is probably very simple and I’m just making things way more complicated than they should be.

lmiq · December 26, 2020, 12:18pm

Some comment, which might or not apply to the use case here, but which are more or less what is typical in particle simulations:

Generally when one thinks on changing the type of interaction, (lennard-jones, or gravitational), we define a new function:

lennard_jones(p1::Particle,p2::Particle) = ...
gravitational(p1::Particle,p2::Particle) = ...

Then one would pass one or the other to the function that does the simulation,

sim = run_simulation(p,pot=lennard_jones)
function run_simulation(p::Vector{Particle};pot=default_potential)
  ...
     u += pot(p1,p2)
  ...
end

Futhermore, it is common that each potential energy depends on different parameters, and then you can use this syntax to pass the function receiving the parameters in a closure:

lennard_jones(p1,p2,eps,sig) = ....
sim = run_simulation(p, pot = (p1,p2) -> lennard_jones(p1,p2,eps,sig)

That will give the user complete control on the implementation of the potential energy function independently on how the call to it is implemented in the run_simulation function.

If the user wants now to use the sum of the interactions, he/she would be able to write, for example,

totpot(p1,p2,eps,sig) = lennard_jones(p1,p2,eps,sig) + gravitational(p1,p2)

pass totpot to run_simulation, avoiding the type-instability problems in the interactions and run-time dispatch. If the “types” of interactions are many, that may be even more transparent to the user.

I think that if one goes to the direction of trying to implement a general simulation function that accepts lists of interactions of different “types” and of particles of different “types”, one inevitably will hit type instability problems and run-time dispatch and that will degrade performance. The fastest way to dispatch on multiple types of particles or interactions will be at the end manual splitting of the loops with if isa T1 -elseif..., but even that ends to be highly suboptimal relative to restructuring the data (we have recently discussed this quite extensively here). Thus if you “hard-code” the splitting inside your run_simulation function (with for interaction in interactions..., you are driving the user to implement a suboptimal solution to his/her problem.

If you want to give flexibility on the types of interactions and of particles, I think that a reasonable strategy would be to check which is less fragmented (probably the interactions). For example, if I had thousands of particles of 5 types of particles and 2 types of interactions, the data structure I would use would be to define 5 structs containing each all the data of the particles of that type, such as:

struct NeutralAtoms <: Atoms
  p :: Vector{SVector{3,Float64}} # length = number of particles of type NeutralParticle
  eps :: Vector{Float64}
  sig :: Vector{Float64}
end
struct ChargedAtoms <: Atoms
  p :: Vector{SVector{3,Float64}} # length = number of particles of type NeutralParticle
  eps :: Vector{Float64}
  sig :: Vector{Float64}
  charge :: Vector{Float64}
end

and the functions that compute the interactions would perform all the work for a specific pair of types of particles each:

lennard_jones(atoms1::Atoms,atoms2::Atoms) = compute all interactions of the sets of particles
coulomb(atoms1::ChargedAtoms,atoms2::ChargedAtoms) = compute all interactions

You still have to loop over the types of particles and types of interactions, but the costly part will be inside the functions that compute each set of interactions over all pairs of particles of the sets with specific types, and these functions will be type-stable. Run-time dispatch can occur but only for the types, not for each particle individually, thus much less frequently.

MA_Laforge · December 26, 2020, 6:05pm

While @lmiq covered another possible dimension you might want to consider, I will try to address your question more directly.

Firstly, if you use the parameterized declaration you wrote:

function compute_interaction(particles_i::Vector{T}, particles_j::Vector{T}) where {T <: AbstractParticle}

The resultant method will only trap calls where both vectors are of the same element type, for example:

particles_i::Vector{CustomParticle} & particles_j::Vector{CustomParticle}
or particles_i::Vector{AbstractParticle} & particles_j::Vector{AbstractParticle}

That does not appear to be your intent. Instead, I believe you wanted to compute interactions between particles between any vector with elements that are <:AbstractParticle:

function compute_interaction(particles_i::Vector{T1}, particles_j::Vector{T2}) where {T1 <: AbstractParticle, T2 <: AbstractParticle}

Notice how there are two distinct T1 & T2 that are allowed to be different.

That means this method will also trap calls where the vectors have different element types, for example:

particles_i::Vector{CustomParticle} & particles_j::Vector{AbstractParticle}
and particles_i::Vector{CustomParticle} & particles_j::Vector{Particle}
and all other combinations.

Warning on having two vectors

Having said that, my guess is you probably don’t want to have two vectors. You probably want to loop through all combinations from a single particle vector, and skip the interaction between particle_i and itself.

Michael_Wang · December 26, 2020, 6:10pm

Clearly, when I do an overhaul, I will need to think carefully about the whole structure of the package. I’ll take a look at the discussion you linked. After thinking a little more about the question I posed, I realized that I indeed did make it more complicated than it should have been. I think I can do the following using parametric types (which MA_Laforge suggested)

function compute_interaction(particles_i::Vector{T1}, particles_j::Vector{T2}) where {T1 <: AbstractParticle, T2 <: AbstractParticle}
...
   compute_force(p_i, p_j)
...
end

where the dispatch function is simply

function compute_force(p_i::T1, p_j::T2) where {T1 <: AbstractParticle, T2 <: AbstractParticle}
   compute stuff for p_i, p_j
end

Will this function still work if I user defines CustomParticle <: AbstractParticle outside the package?

Michael_Wang · December 26, 2020, 6:21pm

Ah right thanks. I actually originally wrote it your way with two different types but then thought that the other way might work.

I actually do need two vectors of particles (at least in the current implementation/logic of the package). I find neighbors using Cell Lists and it turns out in certain situations (e.g. particles of different sizes) it is more efficient to group up particles such that you compute small-small and small-large particles separately. Indeed in the case where the two vectors are identical, then there is no need for two separate vectors.

lmiq · December 26, 2020, 6:38pm

Yes, it will.

Yet, just know that if you do nor specify at all the input types everything will work as well:

function compute_interaction(particles_i, particles_j)

The only reason to annotate types here is if you want explicitly to limit the flexibility of the input to the function. There is no performance benefit.

The performance will be worse if you input vectors of mixed types, but this probably not a good reason to disallow them at all.

Edit: if the types have the same fields (for instance if one has one size, and the other other size) maybe you do not need more than one type of particle, but just one type with different field values. That will make things easier (I was imagining circles and squares, for example).

Michael_Wang · December 26, 2020, 6:44pm

Just to be clear, function compute_interaction(particles_i, particles_j) will perform fine as long as I don’t operate on the vectors particles_i and particles_j within this function and instead dispatch depending on the types of each element in the for loop? Or is there a performance degradation for simply looping over abstract types?

lmiq · December 26, 2020, 6:53pm

It will perform as long as the types of particles within each vector are the same, which is the case you are enforcing with the parametric types. If the input has that structure the method will be specialized to it anyway.

That happens there where p1 is always of type T1 and p2 of type T2, but that is not dependent on the annotations, but on the content of the input vectors.

Michael_Wang · December 26, 2020, 6:55pm

I was thinking of something like that. Although I really like the idea of organizing my simulation with different types of particles, which may share fields, instead of cramming all the properties into a single type. This may certainly lead to some performance degradation, but I’m willing to sacrifice some performance for this kind of organization.

Michael_Wang · December 26, 2020, 7:05pm

Parametric types does allow the possibility that the vector may have different types, in which case it will be of type Vector{AbstractParticle}. In this case, suppose I write

function compute_interaction(particles_i::Vector{T1}, particles_j::Vector{T2}) where {T1 <; AbstractParticle, T2 <: AbstractParticle}
for p_i in particles_i, p_j in particle_j
    compute(p_i, p_j)
end

function compute(p_i::T1, p_j::T2) where {T1 <: AbstractParticle, T2 <: AbstractParticle}

end

If particles_i and particles_j are of type Vector{AbstractParticle}, will this code be less performant than if the vectors were of a single type? The compute(p_i, p_j) should dispatch for the appropriate two types should it not?

Edit: I was looking through the discussion you linked. I guess there is some performance degradation if I have multiple types even if I use parametric types?