JuliaCon 2020 Birds of a Feather

The “Probabilistic Programming in Julia” BoF proposal was accepted! Here’s the text:

Julia’s PPL community is strong and growing. The various groups already have a strong professional relationship, but a BoF would help to formalize this somewhat, as well as being a good introduction for newcomers to the community.

Possible topics (just a starting point really):

  • Standardizing output data structures for samplers
  • Making distributions extensible, and AD- and GPU-friendly
  • Interop across PPLs
  • Connecting with non-PPL libraries (e.g. Flux)
  • What existing (Python, etc) capabilities is Julia PPL missing?
  • What’s next for Julia PPL?
  • What’s next for PPL in general?

As it says, this was intended as a starting point, and I think it’s fine if we want to adjust the scope of the discussion. Also, this is my first time leading one of these, so I’d love any thoughts on strategies for making it as productive as possible. Finally, since JuliaCon is now virtual and tickets are free, I’d guess we might have some people joining us from the broader PPL community.

19 Likes

Related to these ideas, here is another possible suggestion for discussion: implementing MLJ interfaces for the different Julia PPLs.

I know that some work has been done on this, e.g. SossMLJ.jl. It would be great to have interfaces like this for all of the PPLs in Julia, and to get those interfaces to a “ready for use by the average user” state.

@tlienart and @ablaom may be interested in this.

2 Likes

Some relevant discussion here:

3 Likes

So how will this be organized? A zoom call or such?

1 Like

Some thoughts/questions:

  • I don’t know how many people we’ll end up with, but with the nature of the conference we should assume it will be larger than the last JuliaCon. In particular, there’s a significant PPL community outside Julia that may be interested.
  • Definitely a video conference call of some sort. Preferences? Differences in capabilities in Zoom vs Google? Others to consider?
  • We’ll need a way to collaboratively document as we go. Ideally this would have
    • Markdown support (code + LaTeX)
    • Good stability
    • A way to control who can edit, who can comment
  • We should work with organizers of the conference and of other BoF sessions, to share ideas and experience.

As for material, one possibility would be to set some goals here for what we’d like Julia PPL to look like in five years. Then the BoF could focus on next steps for getting us there.

3 Likes

Happy to participate.

4 Likes

We can go with whatever the JuliaCon organizers recommend.

It might be interesting to have a discussion about how compiler tools might help PP?

Personally, I’ve been experimenting with IRTools, Cassette, Mjolnir, etc and there’s a rich area of work which precedes this experimentation in IRTracker (Turing) and Gen’s static modeling language.

With 1.6 and beyond around the corner (including new work on AbstractInterpreter - which should stabilize Cassette somewhat?) - it might make for an interesting discussion. What possible inference optimizations can be performed with flexible access to the IR, or IR metaprogramming (if any)?

I personally am really curious about this, but I operate in a bit of a bubble outside of the main PP groups here, so it would be cool to open the discussion up a bit.

3 Likes

Just an update on this…

There’s a Discord server set up for discussions. If you’re registered, you should have emails in the last few days with a link to this.

When you registered you should have gotten an email from EventBrite with a confirmation number. You’ll need that to log in. Mine was from early May so I had forgotten about it altogether - you might need to dig a bit.

2 Likes

Let’s plan a monthly call to talk about PPL development. Who’s in?

10 Likes

I’m in!

Let me know how I can help out.

2 Likes

I’m also in. Why don’t we shoot for the end of August?

2 Likes

So one thing that I hope comes out of those regular PPL meetings is a unification of the random variable tracing data structures in Turing, Gen, Soss, ProbabilityModels, and other Julia-based PPLs, and separating these data structures into a common package that we can all depend on. We can keep the flexibility of having multiple different structures for different models which Gen and Soss have and apply the performance tricks of Turing’s VarInfo where applicable, but it would help if we at least have a single abstract type with a well-defined interface that we all agree upon and use in our different PPLs. Ideally the concrete implementations can also be re-used in all the PPLs where possible.

This separation of data structures into a separate package should also allow more community-wide contributions to the data structures side of PPLs in general, irrespective of sampling, e.g. improving cache efficiency, GPU-compatibility, etc. The unified interface that we agree upon will also probably enable us to better understand the similarities and differences between the various PPLs in Julia and will probably make interop between them easier as well.

CC: @Marco_Cusumano-Towne @alex-lew @Elrod @yebai @Kai_Xu @trappmartin @devmotion

1 Like

Hi, I’m sorry for missing the BoF. Is there a summary or recording of the discussion somewhere?

ProbabilityModels has been “just a few months away” for many months now, so it’d be unwise to forecast when the library will be ready (they’ve also all been broken since last November). I’ll start registering them once they’re working again and have basic loop and broadcasting support.

It’s handling of data structures is one of the key’s to achieving its performance targets.
I would welcome others using and contributing to the same libraries.
I think this is an area with a lot of potential for exploration, but things may be too experimental/immature for others to wish to adopt them.

Implementation details and data layouts specified in a ProbabilityModels model would be undefined. This would not only true for temporaries, but also the sampled parameters and data used for a fitted model (which may be transformed as a 1-time cost). Possibilities include permuting arbitrary dimensions, or switching from a traditional memory layout like column major to tile-major or an interleaved formats (i.e., split a non-leading axis in two, and permute one of them to the front). This would all be based on cost models, without any guarantees of similar layout between versions or architecture. The cost modeling would be implemented by (a) defining costs for a large number of “built-in” functions like matrix multiplication, a few decompositions, and a host of log-pdfs and their gradient functions (b) extending LoopVectorization to estimate costs of the various data-layouts for any loops.

Most of the data-layout changes will come later; for now they’re just something for me to be mindful of to make sure nothing gets in the way of implementing it. The thing I have implemented to that end so far is that it will choose the order of the constrained parameters with respect to the unconstrained parameter vector used by HMC to maximize the amount of aligned accesses when reading to/from the unconstrained parameter and gradient vectors.

Internal data structures are all handled internally with PaddedMatrices.PtrArray to allow passing around a “stack” pointer on which to allocate memory used within the model. Using PtrArrays also makes it easy to “allocate” memory for adjoint data structures with a pointer to the correct position to the gradient vector.

It’d be good to have parts others are interested in using modular enough so that’s possible without taking on any unwanted dependencies.
But I’m not sure how much of the data structures or the general approach others would even want to touch.

Is there a description of these tricks somewhere?

@mohamed82008 A few comments:

  1. The shape of the trace can be specialized according to the model/DSL in trace-based systems. generically this boils down to high performance dictionary operations in Gen’s dynamic DSL (which is based on a trie, which is implemented using dictionaries). I’m sure there are interesting optimizations to explore there - but that’s a “black box” non-specialized representation which is sufficiently general to allow for most (all?) of Julia in the DSL.

  2. The secret sauce behind Gen’s static DSL is a set of generated functions which create a specialized trace data type and specialized implementations of the GFI methods, when you can express your model in the static language. I’m guessing that performance is close to maximum for this subset of Gen - with respect to trace data types.

I am not familiar with Turing’s internals as well as I should be, but one thing which sort of pains me is that both Turing and Gen require an abstract model interface. In Turing, this allows for inheritance and extension from AbstractMCMC - which is the key link to samplers and inference. In Gen, this provides a similar interface to inference via the GFI. These interfaces are logically the right thing to do if you have to setup calls and state via macro expansion for your DSL, or if you have to link the expanded form of your model to a sampling engine. The main communication issue is that Gen requires the GFI, and all algorithms are written using the GFI, whereas Turing requires AbstractMCMC and all algorithms extended the interfaces defined therein.

Basically - I know why this is useful…but I have a strong (and possibly misguided/wrong) opinion that there is no need for such an interface, that these interfaces can be built directly into the compiler. I can’t prove that this will work for AbstractMCMC yet, but I can prove that it will work for the GFI. It’s actually easy to see why you can do away with the abstract interface in Gen’s case, because the point of macro expansion there is to emulate context-oriented programming. I don’t think interfaces to this style of library (i.e. the Zygote or Cassette style) were developed when Marco started working on Gen, nor are they considered stable (even now), not a great foundation to place a high-performance library. However, a good chunk of the future of Julia relies on stable and performant ways to parametrize the compiler pipeline - a sort of reconfigurable JIT compiler.

In these monthly meetings, I’d like to keep these compiler interfaces in mind. It’s well known that AD performance is likely to drastically improve given access to these pipeline tools. Notwithstanding that improvement to all libraries, I predict the same for probabilistic programming. Furthermore, my suspicion is that integration across systems will crucially rely on some of these new power tools.

@cscherrer was recording something I recall but probably not everything.

That’s the kind of optimizations that I would love to see used across all of Julia’s PPLs. It’s great to have different PPLs with different design choices in the design space but if some methods or data structures are provably better than others, then it would be unwise to limit it to a single PPL if we can have it as a common dependency.

Compared to your work in ProbabilityModels, PaddedMatrices and LoopVectorization, etc. it’s not much of a trick :sweat_smile: The idea is to specialize the types of containers in the trace of dynamic probabilistic programs as type information of variables become available to allow for efficient codegen even for dynamic models. This type specialization together with a considerable amount of generated functions achieve a significantly better performance than generically parameterized container types.

I have actually read that part of Gen and yes I like the idea of using tries here but they don’t come without challenges, e.g. AD now becomes harder compared to the vectorized approach that Turing and I think ProbabilityModels also use. There are trade-offs. Understanding these trade-offs and experimenting with different implementations for each class of models is the goal of my proposal above to separate them out. It also allows us to benefit from the immense experience of people like @Elrod to optimize the underlying data structures where possible. For example, one possible optimization would be to analyze the access pattern of values in the trace data structure and then permuting and aligning the values in memory to minimize the number of cache misses.

Just using “generated functions” doesn’t always give the best performance as packages like PaddedMatrices show. PaddedMatrices beats StaticArrays even for small array operations while using normal Julia arrays under the hood. Both use generated functions in different ways. So I would keep an open mind here and try different data structures and approaches.

If I recall correctly, I don’t think Gen’s GFI and AbstractMCMC are analogous. I think they are more complimentary. The GFI is a lower level interface that gives more control when writing inference algorithms. So it is intended for (power) users’ use. AbstractMCMC just automates the high level sampling logic and things like chain parallelism. Ideally, the AbstractMCMC.step function can be written using a set of GFI function calls. These GFI function calls can be different for different samplers.

Any such interface can in theory not exist by just inlining the relevant implementations in a gigantic function that the compiler emits. But the interfaces provide some nice well-tested abstractions that allow us to more easily reason about the code and write more complex logic on top. AbstractMCMC was born from the observation that there was a lot of repeated code in Turing in the implementation of different inference algorithms because each sample method was implementing the same stepping logic in a slightly different way. Abstracting the logic and separating it out in a package led to AbstractMCMC.

I am excited for the possibilities that these developments will open up in the PPL space. Fast AD will put Turing ahead of Stan on almost all of the benchmarks. This is our performance bottleneck in HMC. But I think the main challenge moving forward is to find sufficiently important and self-contained problems that are common to all PPLs in Julia that can be solved using this new compiler technology. I think some examples right now are: macro-free Bayesian inference (Poirot and Jaynes), exploiting independence structures in models (IRTracker) and faster codegen for Zygote. I would actually love to hear more from you @McCoy about the design of Jaynes and what challenges you faced when developing it or trying to integrate it with different PPLs. This could be one of the PPL meetups. I would also love it if @Elrod gave us an overview of the optimizations used in ProbabilityModels in another talk.

2 Likes

Thank you for this detailed response. I’ll digest.

@mohamed82008 in the meantime, the design of Jaynes is easy to understand:

  1. Take each GFI method from Gen, and translate it into a dynamo.

  2. Special implementations of the methods (I.e. for the equivalent of Gen’s combinators) are activated by special calls. The fallback interception call is rand.

That’s all there is to it. I did change a few other design decisions along the way - (I.e. learnable parameters are kept separate from models, in Gen these are kept as state on the GenerativeFunction).

There are a number of usability benefits with this approach, although there is an argument to be made for making it easier for the user to write programs which are invalid measures over choice maps. By designing the interface to Gen, my hope was that I could show that this approach is completely compatible with Gen as it is now!

And the answer appears to be yes. You can use all the nice stuff that you use in Gen and Jaynes together. In this sense, Jaynes is a sort of more permissive dynamic DSL - without the hinks with multiple dispatch. But you can use the specialized representations and all of Gen’s inference library - and it should just work.

That’s my thought too, maybe we could meet on the last ____ of every month. Any strong preferences? Is this big enough we should have a poll or something?

We’ll also need to think about time zones, hopefully find something that’s not the middle of the night for any of us. That’s potentially tough to manage. But then, people do have funny sleep schedules. So maybe we need “my waking hours in UTC” for each of us, or something like that?

@ckneale I think it’s safe to say any of us would be happy to have contributions of any kind, from example notebooks to docs to new inference methods. It’s also entirely possible for anyone with good math and computing background to get deep into this stuff. I, for one, have never had a formal class in Bayesian analysis or compilers. Hopefully that’s not too obvious :wink:

I agree it would be great to have a general-purpose abstract type for tracing. @alex-lew helped me connect Soss to the trace type needed for Gen, but I have no idea how this relates to Turing’s representation. It would be good to get a better understanding of the VarInfo stuff and see if there’s a best-of-both-worlds, or if some tradeoffs are unavoidable.

My big question here is, is there an opportunity for an IR between the model syntax (as input by the user) and this lower-level representation? Soss can generate whatever code you want and focuses on high-level things like first-class models with transformations between them. It seems really natural to connect with ProbabilityModels.

This is really interesting. I agree there’s been really great progress in this area (e.g., @MikeInnes’s JuliaCon talk). For Soss, it would be really interesting to explore the possibilities of syntax-based codegen handing off to lower-level IR transforms.

Yeah I just had my phone transcribing what it could hear, so it’s just me. Lots about Jen and Jane’s touring sauce :wink: . I’ll see about cleaning it to get a summary, but it’s missing a lot.

Completely agree here. Soss uses generated functions, but it’s not yet very smart about what it generates, so I’d guess Turing is still faster. OTOH customized code that uses more information will always give the opportunity for more speed - question is just whether the speed gain is worth the time it takes to do that analysis and generate and compile the code.

Agreed! @Elrod I’d be very interested in a deep dive into the approach if/when you have the time.

2 Likes

I’m on the West cost of the US, so I’m happy to wake up nice and early (after 6am PST) to give a bigger call-in window to people from other time zones. Most other times are fine.

1 Like