Building a Patient Level Prediction Package within Julia

Hey folks, :wave:

I have been considering developing a new package within JuliaHealth that focuses on Patient Level Prediction (PLP). In brief, PLP is the notion of defining a patient population that you are interested in (a target population) and a second population (the outcome population) that evolves from the first population. A canonical example is if you create a target population around patients diagnosed with hypertension and you want to see or predict who goes on to have ischemic stroke as a possible outcome.

The data that I work with is retrospective as well as standardized (e.g. patient claims, electronic health records, etc.). What I am wondering is: what would be the best predictive framework within Julia to build such a package? I have examined:

  • MLJ – I love how “off the shelf” and easy it seems to use this package; documentation is great too!
  • ModelingToolkit – I have heard great things about this package but haven’t tinkered too much with it yet; seems a bit broader than MLJ but am unclear
  • Omega – the causal modeling here looks great but I am not sure how flexible it is to answer further questions beyond just causal ones

But am trying to figure out what would be the best for the use cases I laid out. Also, happy to hear any feedback on if I am missing anything in particular with my line of thinking as is.

Any thoughts? Thanks!

~ tcp :deciduous_tree:


Have you looked at the standard flexible PPLs e.g. Turing.jl Soss.jl and Gen.jl?I’d have thought a good strategy would be playing around with a Turing model. Then porting the model to (the more lightweight) AdvancedHMC.


I’ve seldom heard of these packages and terms! What is a PPL?

Would you mind sharing more on the strategy you are imagining? I have no idea about Soss, Turing, Gen, and AdvancedHMC so would be quite curious to hear what you were imagining. I have not had the opportunity to see or use those packages before so I would certainly value your opinion.

Is it correct to assume that you’re looking to cover the kind of functionality that OHDSI’s patient-level prediction R package provides?

I think MLJ is probably the correct choice here. What seems missing to me right now is off-the-shelf integration with some sort of survival modeling. There is this scikit-learn extension package: scikit-survival, and since it seems scikit-learn itself was easily wrapped into MLJ, I imagine it would be straightforward to wrap this package. Alternatively, for a pure julia implementation, it looks like the maintainer of this SurvivalAnalysis.jl package might consider integrating with MLJ.

PPL = probabilistic programming language (e.g., Stan, Pyro, Turing.jl, and Omega), useful for very flexible Bayesian inference computations

My guess is that the audience of your PLP package would not necessarily be interested in detailed Bayesian modeling of the domain, but I could be wrong!


Sorry! Yes PPL is as described above. I mentioned using Turing etc. (and assumed you knew about them) since you mentioned Omega.jl which is a less well known PPL.
Also, this kind of Bayesian modelling is nice for making predictions while incorporating uncertainty we have about our parameter estimates and for learning how much the data from one group should tell us about another (and using that info in parameter estimates and predictions) e.g… if you have data clustered by countries or hospitals etc. Statistical Rethinking 2022 Lecture 13 - Multi-Multilevel Models - YouTube

But, I don’t know anything about your field, so (as stated above) this may not be the done thing or an attractive approach for your target audience. Also, if you are working with big dada, (the more popular) Bayesian methods don’t scale as well. Though there are very clever people in the Julia community, who can advise on methods for larger data sets. since you mention MLJ - for smaller data, if you want a more mache learning-y approach, bayesian neural networks are an option A Bayesian neural network for toxicity prediction | bioRxiv

1 Like

I think the first step, before jumping to which tools to use, is to figure out exactly what data/labels you have available, and what task you want to do/what you want to learn. Those can be really tricky questions and might require a fair bit of investigation and data-diving to see what is in your dataset, what the data quality looks like, and what is even possible to learn from the data, or what predictive tasks you can try to accomplish with the labels you have.

Once you have a handle on that, I would suggest the second step is to trying to figure out what techniques can help you perform whatever task you want (tons of machine learning options, probabilistic programming, deep learning, etc).

Likely the next step is to then choose a tool (such as one of the ones you linked) to use to perform that technique, and to give it a try. You may need to revisit earlier steps and decisions depending on what you find and how it goes.


Hey @awasserman,

Wow! Thank you for the very thorough answer!

Yes, you are spot on. Drs. Reps and Schuemie in OHDSI have done great work on developing that ecosystem and I quite enjoy it. However, using some of this tooling, a problem that I am running into is that it is dreadfully slow (too slow for my teams) on large databases and doesn’t compose very well with other R packages. Was hoping to build something in Julia that could solve those two problems within the context of the Julia community!

Yea, MLJ was what I was leaning towards as well. Plus, @RaphaelS1’s work on SurvivalAnalysis.jl is awesome. Integration would be outstanding and I’d love to see it happen.

I’ll have to investigate this further! Still needing to read more papers and chat with some folks to get a better handle through this all!

Thanks for the thoughts and explanations Asher!

Thanks for all the thoughts here – this is super helpful for me getting a handle around the Julia space. I am newish to learning about what would make sense for what method libraries to reach for when considering varying scaling problems. Love the paper too – do you have any more papers that you may recommend?

Firmly agree. I left it out of my initial post but I am at that point of familiarity with that data.

This makes sense too. My thought was that this is how it would work practically but wasn’t sure – thanks for shedding light here.

Seconding what Eric said, if the issue is that there are too many different modelling options with disparate interfaces (I know that people have thrown pretty much everything under the sun at EHR data, for example), perhaps the value lies in the non-modelling parts of the workflow? Since the data is standardized with (hopefully) a regular structure, more opinionated interfaces for working with it could be interesting. I imagine you’d want to abstract out some of the logic of defining and managing patient populations too. Doubly so if ad-hoc implementations of it tend to be tedious to write or error-prone.

Keeping the prediction interface as high level as possible would save from having to commit to any particular library or method for creating models.


I assume that you’re working with data in an existing OMOP CDM instance (or some similarly standardized data model)?

I think the main question I have is what you would want to do for both cohort building and feature extraction. Both of those tasks have well-trod implementations in the form of OHDSI’s CohortBuilder and FeatureExtractor packages.

I’ve been mulling over the idea of writing my own implementation (Julia-based, of course) the FeatureExtractor package in particular, since I think that there’s a lot that could be done in a Julia implementation that would improve the ergonomics of the library and perhaps even add some new functionality.

1 Like

This is accurate, but to be clear, “Big Data” usually means much bigger than you’d think :smile:

Hi there. I am a co-creator and the lead developer of MLJ. Our team knows about SurvivalAnalysis.jl and have had some interactions with it’s author. We are definitely interested in integration with MLJ, but this is not there yet.

I am happy to provide detailed guidance for implementing MLJ’s interface (which will need some extensions to accomodate survival-analysis types of prediction) but not sure if we have the resources to push this along by ourselves just now. We were planning on making this a GSoC project. @Sebastian_Vollmer may want to comment on this further.


Hi folks, sorry for the extended hiatus – I have been ruminating on much of the discussion here.

You are correct that I am working with an OMOP CDM instance! So, in terms of Cohort building, I have been developing some JuliaHealth ecosystem tools to enable this. For Cohorts, I have developed OMOPCDMCohortCreator.jl, to create connections to any database I am working on the prototype package DBConnector.jl, but I have not yet touched feature extraction. It would be great if you did build your own implementation as HADES does not compose as I would like – going through great pains here within JuliaHealth to ensure composition. And that, to your point about ergonomics.

@ablaom , great to cross paths! I wholeheartedly love your work with MLJ and am excited about LearnAPI.jl too. Survival Analysis is certainly interesting, but a bit beyond what I am imagining doing at the moment. However, thanks for pinging Sebastian – we’ve been having some conversations in the background and I think we are finding a successful path forward here. Thanks!