Swift for Tensorflow rationale

https://github.com/tensorflow/swift/blob/master/docs/WhySwiftForTensorFlow.md

This is interesting and is reminiscent of Julia’s approach in flux.

What seems divergent is the lack of multiple dispatch (+1 for Julia) and the static memory analysis/guarantees (+1 for swift).

Any thoughts?

tagging @MikeInnes in case he doesn’t see this post given that it’s in the off topic section.

4 Likes

The topic of programming languages for machine learning was extensively discussed here: On Machine Learning and Programming Languages

1 Like

I’m excited to see what the Swift folks come up with here. We’re obviously targeting a similar set of problems, and there are more than enough differences in philosophy and design choices to keep things interesting.

Since our blog last December, several “ML as programming”-style frameworks have arrived on the scene; Swift/TF being one, but also Myia and Fluid. Where the original frameworks were built by practising ML researchers filling their own needs (Theano, Torch, Chainer), and second-wave ones were largely industrialised clones (TensorFlow, PyTorch), this third generation increasingly being built by compiler and languages people, and it looks very, very different to what came before.

One thing that’s notably missing from the landscape is a truly new language for ML. The attempts in Swift and Julia are beginning to highlight the semantics and engineering challenges and tradeoffs involved, and I expect we’ll see much more on that front over the next few years. Exciting times!

15 Likes

It’s interesting also that they had considered Julia also, and had many kind things to say about it. :grinning:
Swift (and Rust) are also two languages of interest to me.

1 Like

In what ways do they seem different? Are user-friendliness or expressiveness among them?

Perhaps you’ve answered this in the blog post?

1 Like

Isn’t Julia that new language? Or do you foresee the need for something else?

1 Like

Yeah exactly. When I saw the presentation about swift for Tensorflow Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018) - YouTube, it reminded me of that blog post

1 Like

I’m also interested in Swift. Both for general programming and for data science/ML. So far the data science/ML part outside of iOS/MacOS (which has coreML/Accelerate etc) has been very weak (apart from some hobby projects) but TF for Swift could change all that. It’s exciting to see a modern statically typed AOT compiled native language in the data science/ML space. I’ve been wanting this for years. I’ll keep a close eye on it.

The strong interop with Python they are building also seems exciting (especially for data science/ML).

2 Likes

Absolutely. The main difference right now is providing an intuitive programming model while also being able to take advantage of optimisations and new hardware accelerators easily. There’s been less in the way of new PL features that support ML so far, but projects like Myia have a good opportunity to start exploring that area.

As existing languages go, Julia and Swift are by far the best suited, but they’re ultimately still general-purpose languages that weren’t designed with ML in mind. They inherently bring engineering challenges and expressiveness issues that something more specialised might not have.

An engineering example – The Swift docs give a good idea of the challenges involved in extracting TensorFlow-compatible graphs from a program. It sounds like it should be pretty easy to turn m = Chain(Dense(10, 5, relu), ...) into a graph, for example, until you realise that a model might do m.layers[3] = Dense(...) halfway through the forward pass.

While these things are solvable, mutable data structures causes a lot of issues here as well as with AD and other optimisations, and are not even necessary for the way people code against data frames and GPUs. A new language could easily have a functional data model and simplify things hugely.

For an expressiveness issue, consider my ideal definition of the Dense (FullyConnected) layer:

Dense(in, out) = x -> W * x .+ b
  where W = randn(out, in), b = randn(out)

The Flux docs actually introduce layering this way, but in real layers we have to define a struct and a bunch of boilerplate. To actually make it work we need to be able to treat closures as data structures (to move them to the GPU, for example) and perhaps have nicer ways to name and refer to closures based on where they came from (i.e. not (::#9)). These really seem like general language-level features that just happen not to supported anywhere.

ML has a bunch of cases like this, where certain patterns seem unusual or even downright backwards from the traditional software engineering standpoint, but turn out to be valid use cases that just aren’t prioritised by mainstream languages. Novel abstractions and language features could help hugely here, which is part of what makes the field exciting for PL people.

13 Likes

5 posts were split to a new topic: Swift string handling

Is this something you are working on lately? Could Julia handle this via metaprogramming? (i.e. embed a ML specific DSL using macro).

3 Likes

I’m not sure…seems like it’s a type system matter, per that use of the where clause.

@MikeInnes, thank you for the explication. I’m also curious if anything is planned for these ML specific facilities.

1 Like

I’ve only heard of Swift for Tensorflow now in 2019, because of a talk last month in the TF dev summit Swift for TensorFlow: The Next-Generation Machine Learning Framework (TF Dev Summit '19) - YouTube

I’m still trying to understand, what is “Swift for TensorFlow” anyway and why is it not just a library? Were there limitations in the Swift language that required major changes to its compiler so that things worked better? And would those be limitations that Julia has overcome from the start with maybe a more sophisticated type system or better support for meta-programming?

3 Likes

Julia has a flexible compilation system that allows things like Zygote to hook in implement source level AD*.
Which means julia can have really good AD packages in 3rd party libraries.

Apparently Swift does not, and it is required (or at least preferred)
to build the AD right into the language itself.

Julia did make intentional design choices to facilitate the building of things like Cassette and Zygote. Where as Swift did not, until now, and now rather than exposing hooks so anyone can plugin there tricks, they are doing it once, and hopefully in a way that works for all users. Which is not nesc a bad thing – julia has literally half a dozen reverse mode AD libraries which can be bewildering.

(*the same stuff that was added to hook in here also allows all the other cool nonAD Cassette stuff)

5 Likes

Isn’t this the same thing that also enabled CUDANative? Sounds like a pretty important difference between Julia and Swift as platforms to enable technological development. Can we say this is all because Julia recognizes the importance of meta-programming and staged programming, while Swift isn’t much different from e.g. C++ or Java, and seems to be forcing innovators to be already “retrofitting” the language to do get what they need?

2 Likes

The short answer is that ML needs some compiler support – for AD at a minimum, but also ideally things like array optimisations – and in basically all non-Julia languages this means forking the compiler.

Julia’s secret weapons are generated functions, which were originally designed as a kind of type-level macro, but turn out to be general enough – when combined with some other reflection tools – to hook into the compiler in essentially arbitrary ways. CUDAnative was the first to exploit this for compiling to GPUs; Cassette was the second, providing a “contextual dispatch” mechanism motivated by both CUDA support and AD, and later optional IR passes. Zygote was the first semantic transform built on that mechanism, followed by Hydra.jl.

Generated functions would probably look quite different if this had all been carefully planned out. Really, it’s some mix of accident and Julia’s Lisp-like philosophy of being as flexible and powerful as possible without a specific goal in mind. For a lisp-minded language designer, the goal isn’t to ingeniously account for every case, but the opposite – to let people do things that are completely surprising.

That said, the core team is very much involved here, and the close collaboration between ML and language design folks is one of the great strengths we have. We don’t want to special-case AD in the compiler (just because we don’t have to) but AD is motivating better compiler support for closures and functional patterns, more powerful compiler plugins that can better exploit type information, and more general array optimisations. I’ll bet that there are a lot more surprising things that people can do with that.

26 Likes

Every AD system has its pros and cons. I am not sure relying on one is a good idea.

6 Likes

Thanks, @MikeInnes. I am preparing a small talk about Julia, and I’m trying to better understand these details, and in especial how Julia contrasts to other initiatives such as the ones mentioned in that December blog post by yourself and others. I often read in articles about CUDAnative and Zygote or Cassete that these libraries somehow “hacked” the Julia compiler. I understand that there may have been new features implemented in Julia because of these projects, but is it really fair to use the “h-word” if it is something mostly enabled by existing Julia features, namely generated functions? Also, is it really fair to call it an accident while Julia has explicitly followed, like you say, a Lisp-minded phylosophy from the start?

The reason I think this matters is because if it is the case that Swift for Tensorflow requires a lot of changes to the “core” of the language while Julia is inherently “hackable”, this sounds like a relevant Julia feature that is perhaps not stressed enough in the literature. What does the future hold for these two projects? Did Swift require just a single “hack”, and never again will this be needed? Or is this whole thing actually another case of the two-language problem?

If Julia is naturally a platform to explore ML applications based on meta-programming while other languages require you to be often “going outside” the language, than shouldn’t we try to be clearer about who is hacking the host language and who is just exploiting the amazing possibilities offered by a language that is not a stranger to meta-programming? Or is it the case that the alternatives are themselves going to become like Julia in this regard?

7 Likes

Some context would help, but I assume that these articles use the term in the sense of

To interact with a computer in a playful and exploratory rather than goal-directed way.

IMO the key here is that Julia’s designers and contributors usually abhor special-casing something when a general way can be provided. Many of these then find usage outside the originally envisioned purpose, which may then feed back to the language design to make it more generic.

I would not read much into this philosophically, it is just basic good design. Also, it is somewhat different from the original Lisp tradition of allowing reflexion and introspection everywhere without worrying about the cost. Attention to whether something can be made fast (eventually, maybe not in the first pass) is built-into Julia’s DNA.

8 Likes

In the case of Cassette specifically, it was designed for AD originally, but is general enough that it can also being used for mock testing and to implement an instrumenting debugger. Supporting Cassette required adding hooks into the compiler—and those hooks in turn allowed Zygote, which takes a somewhat different approach to AD but uses the same hooks.

Building a single approach to AD directly into a programming language feels wrong. You’re almost certainly not going to get it right the first time and if it’s built into the language itself, then you’re stuck with the baggage of your early attempts. Swift has been pretty quick about releasing new major versions so maybe making breaking language changes is less of a concern for them.

11 Likes