Purely functional modules

It seems to me in the data science field some types of modules can be written without modifying any shared state.

What if there is a type of module called fmodule that not only cannot modify shared state, but doesn’t allow its own namespace to be modified in any way, and can only use other fmodule’s outside of standard library?

Users of such fmodules can import its functions and types into its own name space and extend them, but they become new definitions in the user’s namespace. The imported functions are just redirects to the imported fmodule’s code and state, but any extended functionality lives in the importing module.

I’m thinking such fmodule’s may help to solve several problems:

  • coexistence of different fmodule versions in the same runtime.
  • avoiding invalidations
  • caching of generated code

This is just my entry level experience with Julia I’m interested in hearing more thoughts about this.

Bumping this thread. Is there any support for developing in this kind of direction? It seems raising performance should be a big long term goal.

I’m not a computer scientist, so I may not be the person to answer here, but I have a feeling this would be very difficult to implement and the benefit would be minor.

For instance, you mention stdlib, but there is no guarantee those functions are pure either.
And one issue I’ve seen before is that anything that modifies the method table is technically not pure either, as the method table itself is a persistent state.

There is already quite sophisticated machinery in the compiler that recognizes pure functions and optimizes on them, so this distinction would only be a version of the old @pure macro which iirc was deprecated outside the compiler because programmers were just way worse than the compiler at recognizing truly pure functions and mistaken annotations caused super strange bugs.

1 Like

Does defining new functions with fallbacks to the original functions address your desires?

In this example, I define a module with a couple of functions and constants, and then define a new module that extends the functionality of the original module in some very invasive ways; nevertheless, the methods and constants in the original module remain unchanged, thus avoiding any invalidations. Using namespacing, we can access methods both from the first module and from the second. The aliasing has zero runtime overhead due to inlining.

using Test, SnoopCompileCore

module M1
    export square, cube, THREE, FOUR
    square(x) = x^2
    cube(x) = x^3
    const THREE = 3
    const FOUR = 4

square_twice(x) = @noinline M1.square(M1.square(x))

# No invalidations
@test isempty(@snoopr @eval module M2
    using Main.M1
    for name in names(M1)
        if getproperty(M1, name) isa Function
            @eval :($name)(args...; kw...) = M1.:($name)(args...; kw...)

    square(x) = x^2 + .01
    square(x, y) = x^2, y^2
    const FOUR = "4"

# This on the other hand does cause invalidations
@test !isempty(@snoopr M1.square(x) = x^2)

@test M1.THREE === 3
@test M2.THREE === 3
@test M1.FOUR === 4
@test M2.FOUR === "4"
@test M1.square(2) === 4
@test_throws MethodError M1.square(2, 4)
@test M2.square(2) === 4.01
@test M2.square(2, 4) === (4, 16)
1 Like

Neat code. Yes this is what I have in mind.

  • M1 would be declared as fmodule.
  • M2 can be a regular module or optionally fmodule
  • extend M1:square would have the same effect as your name importing loop.
  • M1.square(x) = x^2 would be disallowed by the compiler
  • M1’s methods would only be generated one time; the binaries can be safely saved on the local machine and reloaded.
  • Cross-method optimizations are still available. Different levels of intermediate representations live with M1, final binaries have the option to live with the clients of M1.

I wasn’t sufficiently careful with use of “purely functional”. fmodules can still modify internal state; they’re functional with respect to the loading environment.

M1 defines an infinite set of methods because x could be any type, so this can’t totally happen, but once methods are generated, they would only need to be invalidated when updating M1, any of M1’s dependencies, or julia itself.

The main advantage I see here is prohibiting invalidations (rather than simply discouraging them).

You’re right M1 cannot generate all possible methods. Let me try to work out some examples…

One of the advantages in Julia is having duck typed methods, but it also means we have to open up a module’s functions for extension by other modules. In the example below, to use M2 it’s necessary to extend the method table for M1. This conflicts with the idea of “locked”, or “load-time functional” modules.

However, it should still be possible to disallow invalidations at compile time, and gain the benefits of code caching. In my example I used the concept of owning a function due to it living in your namespace, or owning a method of another namespace through owning the first sequence of argument types. For any method definition, only one module is legally able to define it.

This invariant should remain true even if you have different versions of the same module. The namespace can be aliased to UUID of the specific module version.

using Test, SnoopCompileCore, InteractiveUtils

module M1 #fmodule
	mult(x, y) = x*y

module M2 #fmodule
	import Main.M1
	square(x) = M1.mult(x, x)

struct T

@testset "tests" begin
	#this creates M1.mult(x::Int, y::Int) method instance
	@test M2.square(4) == 16

	#owns type T so no invalidation possible
	@test isempty(@snoopr M1.mult(x::T, y::T) = T(x.val * y.val))

	#owns function `mult`` so no invalidation possible
	@test isempty(@snoopr mult(x::Int, y::Int) = x - y)

	@test M2.square(4) == 16

	@test M2.square(T(4)) == T(16)

	#tries to define method for non-owned function and non-owned types, causes invalidation
	@test !isempty(@snoopr M1.mult(x::Int, y::Int) = x - y )

	#broke code because we neither own the function nor argument types
	@test M2.square(4) == 0

module M3
	import Main.M1, Main.M2
	struct T end

	#owns the first sequence of arguments, allowed
	M1.mult(x::M3.T, y::Main.T) = nothing 

	#doesn't own the first sequence of arguments, potentially invalidating, not allowed
	M1.mult(x::Main.T, y::M3.T) = nothing 

You should be able to implement some of this via metaprogramming or perhaps reflection. I highly encourage you to do so.

Adding methods to a function (even methods whose types you own) can invalidate existing methods that call that function.

module M1 #fmodule
    g(x) = 7
    f(x) = g(x)

module M2
    struct T end

    using Test, SnoopCompileCore
    @test !isempty(@snoopr Main.M1.g(::T) = 8)

It seems due to the dynamic nature, a module can always invalidate its own allowed definitions.

using Test, SnoopCompileCore

module M1 #fmodule
    g(x) = 7

struct T end

#creates MethodInstance Main.M1.g(::T) = 7
println(@snoopr Main.M1.g(T()))

#invalidates previous instance
println(@snoopr Main.M1.g(::T) = 8)

#redefines methods instance but no invalidation due to no users yet
println(@snoopr Main.M1.g(::T) = 9)

#creates MethodInstance Main.M1.g(::T) = 9
println(@snoopr Main.M1.g(T()))

#invalidates previous instance
println(@snoopr Main.M1.g(::T) = 9)

You can get invalidation in deeper layers so syntax analysis maybe are not possible.
However, the ownership of method instances isn’t changed, only one module can define any method instance. In your example the M1.g(::T) = 7 instance is owned by M2.

Perhaps redefining a method you own should be a runtime error; I can’t see any real-world use for doing that. If M2 was interested in M1’s value of g(x), it would call it using a shared type, i.e. a primitive type, or a type declared as “shared” by the owning module.

That flow chart is not strict enough to prevent method invalidations because the first non-shared type here (T) is owned by Main, so the redefinition is allowed.

I think that a simpler flow chart might work better

Method definition → Owns function name? Yes → allowed; No → not allowed.

I.e. completely disallow defining new methods on existing unowned functions. Using the aliasing trick, we can still get the effect of extending methods but they would all be new methods rather than invalidated old methods.

OK I made this more concrete example to show why it’s necessary to define some methods outside of the original module. The clients of the modules have to populate the methods table with concrete implementations. The functions themselves are just abstract interfaces. The problem is when you end up pirating types others thought they were defining all on their own.

using SnoopCompileCore

macro shared(s) :($__module__.eval($s)) end
export @shared

module Computations
	using Main
	export Num, Device, add

	Must implement the following method:
		run(a::Num, b::Num, dev::Device)::Num
	abstract type Device end
	# Computations must own and define the function 'run'
	function run end
	# declaring Num as shared allows the Device argument, which determines method ownership, to come after Num args.
	@shared struct Num val::Float32 end

	add(va::Vector{Num}, vb::Vector{Num}) = [a.val+b.val for (a, b) in zip(va, vb)]
	add(va::Vector{Num}, vb::Vector{Num}, dev::Device) = 
		[run(a, b, dev) for (a, b) in zip(va, vb)]

module User1
	using Main.Computations
	function run()
		#this a normal use situation which doesn't require expanding Computations method table
		@assert isempty(Main.@snoopr ret=add(Num.([1,2]), Num.([3,4])))

module User2
	using Main.Computations
	struct GPU <: Device end
	# User2 must be able to extend Computations.run to specialize for Device type
	Computations.run(a::Num, b::Num, dev::GPU) = a.val + b.val - 0.000001

	function run()
		# this doesn't cause invalidation because User2 is exclusively able to define Computations.run with GPU argument
		@assert isempty(Main.@snoopr ret=add(Num.([1,2]), Num.([3,4]), GPU()))

module User3
	using Main.Computations
	struct Abacus <: Device end
	# User3 must be able to extend Computations.run to specialize for Device type
	Computations.run(a::Num, b::Num, dev::Abacus) = a.val + b.val + 0.000001

	function run()
		# this doesn't cause invalidation because User2 is exclusively able to define Computations.run with Abacus argument
		@assert isempty(Main.@snoopr ret=add(Num.([1,2]), Num.([3,4]), Abacus()))

# at this point method instances with GPU and Abacus arguments are already created statically
@show methods(Computations.run) 

@show User1.run()
@show User2.run()
@show User3.run();

Does there also need to be some way of preventing User0 from defining run(::Num, ::Num, ::Any)?

Yes. Since all the arguments are shared types, only Computations can define Computations.run(::Num, ::Num, ::Any). You can think of shared type as any type that does not confer nor deny permission; they’re just ignored for purpose of identifying which module can define what.

Right. Your flow chart covers that. What about parametric types? Who owns M1.T{M2.T}? I think that this ownership system may end up bearing a striking resemblance to a possible method ambiguity resolution system: Automatically resolve most method ambiguities · Issue #47325 · JuliaLang/julia · GitHub.

This generally doesn’t work. The essential problem is that Julia’s type system and method selection procedure is sufficiently complicated, which means that it would be hard to find some criteria that both conform current practice in the community and be easy and simple enough to use (if possible). Any attempt to implement such feature is equivalent to solve the problem of incremental and separation compilation, method ambiguity problem, neither of which is easy (unless you have found a clever way to solve Julia’s TTFX problem).

Note : Invalidation and code cache won’t impact Julia’s semantics. An extremely dynamic, inefficient, nevertheless correct implementation of Julia would be the one that always looks up method tables and is unaffected by these issues. Only when you try to optimize this progress, you will encounter all these problems. For example, if the method tables from two modules are coupled together, loading these modules requires slow integrity check of the method tables, which leads to TTFX problems. Just like other optimization problems, the more constraints you have, the better result you can get.

Technically you can do this by firstly introducing a restricted static type system. For example, since type parameter is hard to handle, we can only support C-like type system in fmodule. That’s said, the interface allows only concrete types. You can gradually add more type features from other languages.
Unfortunately, there are two problems here:

  1. Introducing generic is difficult. You can’t manipulate a generic element because you don’t know what kinds of operations are defined on this element. Therefore either you follow the practice of Rust/Java by using typeclass/iterface to perform type specializations, or static check is not required, then no guarantee is provided and latency problem is unsolved…
  2. Secondly, no matter how you solve the first problem, fmodule can only depend on fmodule, because static type system must be closed to check its validity and thus it cannot refer to open functions (generic function). Given that Base/stdlib can never be fmodule due to their complexity, regular programming will become extremely hard. Now you see that such attempt will create a limited but also isolated part of system in the language : the dynamic code can refer to static code but not vice versa.

Another problem is that if you want to reduce invalidation, then you shouldn’t support multi-version of a same library. Invalidation ensures that eventually, every module will see a consistent global method table (that’s why you abolish some out-dated code caches), so that the following method compilation can reuse previous cached code. Multi-version means every module has their own view of the method table, which means you cannot simply reuse code caches from other modules since different extensions can lead to different results.

1 Like

Perhaps make the owner of the first type parameter, owner of the parametric type? M2 would own M1.T{M2.T}

Thanks for the interesting reply.

A feature of fmodule could be to disallow circular dependence all together. I’ve been forced to do that already in my modules because circular dependence causes problem in my tools and environment. The result was actually much more well thought out organization, in my own case.

It’s working backwards, finding a set of language constraints so there is no invalidation, code caching can drastically lower startup time. Supporting multiple versions is intended to reduce the effort to maintain libraries. Every dependency is against a specific name-version pair (instantiated from specifications according to the actual setup on the system). Version bloat can be kept down through analysis tools.

Maybe Julia was originally optimized for scientific computing benchmarks, but the future user base will likely come from general data science use. It should have an aim to become as general purpose as possible.