Developing annotation standards for sciml to support reproduceability of published work

Does a project like this already exist outside of julia?

It seems that the current level of integration among julia packages for machine-learning, mechanistic modeling, and data provides a unique opportunity to develop (and provide a julia-based reference-implementation, see (2) below) of a markup standard which can aid in the reproduceability (in the scientific, language-agnostic sense, eg this standard which unfortunately neglects parameterization techniques or anything related to machine learning) of scholarly works.

Even in domains like biology, where standards like the one above existed back when models were essentially parameterized by hand via experts (so that a paper could be annotated comprehensively without the issue we currently face, which is how to annotate learning of model parameters or even model structure, in addition to annotation of the data and the model itself), the level of adoption has been really low. some reasons may include:

  1. the high cost (in mental effort and hours) that an author faces to curate a single publication using an existing standard, like the one above above.
  2. in general, standards organizations don’t publish reference-implementations to automate (1), probably because they are made up of people who program in a bunch of different languages
  3. annotation of even a primarily mathematical entity like a parameterized ODE model may somehow be easier from a domain-specific perspective, thus hindering development of interdisciplinary markup standards. based on very brief exploration I could only find one repository of such a curation effort and it is entirely focused on biology (also, despite the availability of namespaces to support identifiable parameter units)
  4. an interdisciplinary repository would require some group of people aware of the ontologies of their respective domains to work together to ensure consistency

I have no idea how to address (3-4), but my intuition is that (1-2) could be essentially solved by a team working in a single language via a doc-string like approach so that implementing (learning structural and parameter unknowns, simulating, etc) the model for a paper and curating the annotation for that paper could take place at the same time, and could even be automatically pushed to a repository like the one above via something resembling package registration.

We are building ModelingToolkit towards exactly this. It’s an implementation of a symbolic modeling language with parsers from Bionetgen and CellML (and SBML coming very soon, along with integration with Modia so Modelica models can be used as well), and it extends these systems to other model types and allows for automatic combination and compiler transformations on the model form. The way the open compiler system requires it essentially has a spec, because we follow an LLVM style where valid transformations are functions from a valid ODESystem to a valid ODESystem, so “valid ODESystem” needs to be well-defined so all of the tools can compose (for example, https://mtk.sciml.ai/dev/tutorials/higher_order/ is an example of such a pass). This is still somewhat early in the process, but this is something that we are standardizing and we have libraries being written in many domains

  • Power systems
  • Quantitative systems pharmacology
  • Systems biology
  • Pharmacokinetics/pharmacodynamics
  • Systems neuroscience
  • HVAC and building simulations
  • Electrical circuits are coming soon.

Recent developments also include nonlinear optimization and nonlinear (stochastic) optimal control in this representation, with all of the free performance improvements and parallelism coming from its compiler.

JuliaComputing is building acceleration tools on top of this stack as premium accelerator passes:

Pumas.ai 's Pumas.jl is built on this modeling system. And lastly, we can apply this symbolic system to many pure Julia codes:

To finalize all of this, we need ways to represent full machine learning models inside of this system, which really just covered by the ability to register arbitrary Julia functions as nodes into the computational graph and then allow array variables (instead of just arrays of variables which we do now, so struct of array instead of array of structs in the symbolic sense), which is something @shashi is working on now.

The final product is a both a pure Julia Computer Algebra System (CAS) and a modeling system. The reason is because the two go together: making it easy for users to transform their models into numerically better forms (for example, log transforming variables) means compiler passes on ODE representations, and the easiest way to write such passes is to have everything built on a good, robust, fast, and expressive symbolic system.

It may sound like a lot, but there’s buy-in from many different academic groups and companies and we’ve already demonstrated enough process that I am confident to say that this will be a reality in the next 2 years. And a lot of it is already available at https://mtk.sciml.ai/dev/

Finally, answering the “standards” or “specification”, I think it would be easiest to just say that Julia symbolic script is the specification, since JSON/XML/etc. always tend to have limitations and issues. That said, if we find another representation of models makes it easier to represent and save subcomponents (especially for use with GUI tools), then we will definitely create one. Right now I think getting the full system off the ground is the main priority.

6 Likes

This is fascinating, thanks