Making an immutable array type?

I’ve been writing an immutable collections and STM library for Julia based on similar awesome libraries in scala and clojure (work in progress, but repo is here). As I’ve been writing it, I’ve noticed that some of the operations have a performance comparible with both Julia’s native types and clojure’s persistent types (e.g., for the persistent dictionary type PDict: get, in, haskey are very fast). However, other operations, despite being implemented very similarly to clojure’s, have many orders of magnitude lower performance in Julia: for example PDict is ~200-300x slower than Julia’s mutable Dict for insert and about 100x slower than clojure’s persistent map. To be clear, it shoud be slower than Julia’s mutable Dict but not this much slower.

In asking about performance in another thread it was suggested that most of the problem is due to my use of abstract types because some of the immutable types (especially the persistent vector type PVec, on the shoulders of which much of the library is built), encapsulate length in the type. This works approximately like this:

abstract type PVec{T} <: AbstractArray{T, 1} end
abstract type PSmallVec{T} <: PVec{T} end

# Length 0 vectors:
struct PVec0{T} <: PSmallVec{T} end
Base.length(::PVec0) = 0
Base.getindex(u::PVec0, k) = throw(BoundsError(u, [k]))

# Length 1 vectors:
struct PVec1{T} <: PSmallVec{T}
    val::T
end
Base.length(::PVec1) = 1
Base.getindex(u::PVec1{T}, k::Integer) where {T} = begin
    (k == 1) || throw(BoundsError(u, [k]))
    return u.val
end

# ...
# PVec1 through PVec32 are actually generated by macro; they
# store data in NTuples and have a more complete set of methods.
# ...

# Vectors longer than 32 elements:
struct PBigVec{T} <: PVec{T}
    n::Integer  # vector length
    # by using a vector of vectors, we ensure that update operations are
    # O(log n) rather than O(n)
    data::PVec{PSmallVec{T}}
end
Base.length(u::PBigVec) = u.n
Base.getindex(u::PBigVec{T}, k::Integer) where {T} = begin
    if k < 1 || k > n
        throw(BoundsError(u, [k]))
    else
        return u.data[((k - 1) >> 5) + 1][((k - 1) & 0x1f) + 1]
    end
end

I would like to be able to avoid encapsulating the length in the type, but I don’t see a good way to do this; as far as I can tell:

  • using either fields of an immutable struct to store the vector elements or using Tuples or NTuples both require that the number of elements is known at compile time and that any vector of variable length be filtered through an abstract class (like in my implementation);
  • using an Array would break immutability.

In theory, I could use an Array to store the elements and just keep the field “private” such that users can’t (trivially) mutate the objects. This might be okay but for two things. First, I don’t know to what extent this would sacrifice compiler optimizations for immutable types (is there documentation about this somewhere?). Second, it breaks the === operator and the objectid function. (Third, I suppose, is purity of the implementation, but I’m only passingly interested in that.)

My questions:

  • Is there an alternative way to implement this I haven’t noticed that gets around the problem?
  • If I use a type built on Array instead of the PSmallVec types, what compiler optimization am I sacrificing? Are there other known side-effects of using a mutable field that will cause problems in an immutable collections library?
  • Assuming there isn’t an immutable array type I’m unaware of, is this something that could be added to Julia? (to be clear, I’m not asking for someone to go do this, I’m asking if a pull request adding this to the core would be well-received, or if there are reasons it’s not compatible with Julia’s design?)
  • Any tips on how to tell, from the output of Profile and ProfileView, that Julia is slow because of the abstract type system?

If there is no need to encode the length in the type, I would just wrap an array and not define setindex!.

Check out

Yep, this is the approach I am suggesting in the last paragraph (beginning with “In theory”) of my question. But can you answer any of my questions about how this affects optimizations or how to deal with the === and objectid issues?

Would something like this be reasonable:

struct PBigVec{T, SV<:PVec} <: PVec{T}
    n::Integer  # vector length
    # by using a vector of vectors, we ensure that update operations are
    # O(log n) rather than O(n)
    data::SV
end

That will at least get rid of the type instabilities I think.

Would something like this be reasonable:

Hmm, interesting approach! But not really: consider what happens when you append 1 element to a PBigVec{T, PVec32{T}}. You would need to append an entire PVec32 object with 31 empty elements or reallocate everything using a smaller PVec type. So I’d need an efficient way of storing empties of arbitrary types at the least. I suppose the situation might be helped, though, with something like this:

struct PBigVec{T} <: PVec{T}
    n::Integer
    data32::PVec{PVec32{T}}
    dataRest::PSmallVec{T}
end

where appends add onto the dataRest field then append a PVec32 to the data32 field once a full length-32 vector was filled.

However, it’s not clear to me that this will help at all–the type instability is still there just reduced. (Or is it? Are there documents that explain this kind of type instability problem?)

Also, not that I don’t appreciate the code suggestions (I do!), but, just to be clear, I’m not looking for a quick fix so much as an understanding of what’s going on under the hood. I’m trying to make a deep investment in Julia, not to get a library out of the way quickly. This library is a good starting point for me to learn about the internals. Documentation, discussion of the internals, or pointers to places of interest in the core c code would be more useful to me than an edit to my library.

I think what you basically want is this: https://github.com/JuliaLang/julia/pull/31630

2 Likes

That solution will not really do sinc the issue is really the abstract type of PVec, the compiler cannot be sure what code to generate every time a function on the PBigVec is used (if it uses the data32 field for example) since it could be literally anything that has PVec as it’s supertype, even things that potential users define as being a subtype of PVec. That’s really the only thing going on here afaict. I’m not entirely clear why you want to use PVec at all and not just a Vector{PVec32{T}}? Alternatively you could make PVec{T} a struct with a data field itself and an AbstractPVec as the abstract type and use that.

EDIT: Sorry I didn’t fully read your last part of the question. I don’t think the immutable thing will lead to any speedup at least not at the current state of the julia compiler. I tested this a little when I was making a SparseIntSet for DataStructures.jl. Ofcourse, to be 100% sure it is not a bad idea to run some tests yourself, this behavior changes for quite specific usecases I found, even from Julia 1.0-1.3.

1 Like

Ahh, yes, this is spot on. I haven’t read the whole thread yet, but it’s pretty clearly getting at the same issue in the original post. Thanks!

Okay, that’s basically what I suspected. The reason not to use Vector{PVec32{T}} is explained in my question:

  • it breaks === and objectid for the PBigVec type
  • it isn’t clear to me what other optimizations related to immutability I would be sacrificing by using a mutable type in an immutable type, and I would like to understand this better

I can believe (re:your edit) that the optimizations related to the pureness of the immutability of the type are small (after all, clojure’s immutable types use mutable objects internally, but they are private thus have better guarantees on actual immutability).

I do think that the === and objectid problem is non-trivial, though. When storing objects in an immutable dictionary, for example, you want a hash value and equality operator that is safe to use with mutable types (i.e., the hash and equality are the same after a mutable object changes) but that also let’s you hash identical immutable vectors into the same slot. If I understand the === and objectid operators, my original PVec implementation can assure this, but adding a mutable field will break this. Alternately, one has to write an entirely new hash/equality system (which in fairness I may have to do for persistent sets anyway).

I am not sure I understand what those issues are.

Suppose you have two sets:

s1 = Set{Any}()
s2 = Base.IdSet{Any}()

When storing mutable objects, you want to use s2 under most circumstances because the following is bad:

v = [1,2,3]
push!(s1, v)
v[1] = 10

Depending on the details of the hashing schema, v is no longer findable in s1; I’m not sure if this can lead to crashes, but it certainly can lead to unexpected behavior. As a fix, you can use an IdSet which uses objectid and === instead of hash and isequal.

Immutable vectors shouldn’t have this problem, however. If two immutable vectors are isequal they are and always will be equivalent. If you put an immutable vector in a Set, everything is fine (two different immutable vectors should hash to the same value and be isequal), but your set is not safe against mutable objects. If you put your objects in an IdSet, the mutable objects will be fine, but immutables will only hash correctly if we can rely on objectid and === to consider them equal when they actually are equal. Putting a mutable type as a field of an immutable structure breaks that.

If it’s not obvious why this matters, consider the need to have a set that stores both immutable objects and mutable objects in a way that is safe for mutables but also respects equality of immutables. I run across this all the time in clojure and expect that it will be a common problem in Julia as well.

I am not sure this is a problem in practice for Julia.

I would either not mix objects of a different type in a container (because that is a significant performance drop most of the time), or if I must, then not modify them.

Instead of anticipating problems from Julia based on your experience from Clojure, I would recommend that you just use Julia as is. You will still encounter problems and quirks of the language, but they may be totally different ones.

To be clear, (and as I’ve said before) I’m making a deep dive into Julia, not trying to write a one-off solution to some simple problem. I’m trying to write a coherent STM system, which would solve problems of multi-threading that are not currently solved “in practice for Julia”. Such systems are typically based on immutability, and would leverage the many good parts of Julia’s type system. I.e., this is the kind of STM system that has been immensely successful in languages like clojure, scala, haskell, and others. If you want to have a discussion about how this would augment the language or how immutability folds into this project, great! If there are reasons to believe that there’s a better solution to the multi-threading problem, such that this effort should be abandoned, I’m all ears! But please don’t dismiss my attempt to make the language better because it is not how you use the language currently.

Also, to be clear, I have used Julia extensively, and am not just arriving at the language with a desire to turn it into clojure. If I were learning it just now, this would hardly be my first project, and the code in my repository (hopefully!) makes my experience-level relatively clear. Julia isn’t clojure and it won’t ever be, but it has been designed in a clever-enough way that it can benefit from many of clojure’s best features, unlike many currently popular languages (e.g., python, Matlab, ruby…).

I am not dismissing anything, I just don’t see what problem you are trying to solve here (in the narrow sense). Sorry, I did not look at the repo.

Depending on the context, something like #31630 linked above may be your best option, or a dictionary type that uses different hashing for mutable and immutable objects.

Probably you are aware of this, but the solution currently getting most of the attention is

Yep! In fact, it was a lack of good multithreading options in python/matlab/mathematica that drove me to Julia in the first place (also the lack of good numerical support in clojure/scala/haskell/rust).

I’m a big fan of the multithreading approach so far, but it is very minimal and has been designed with a very structured (procedural) paradigm in mind. In particular, it doesn’t currently have a good solution to the problem of many threads needing to update the same data simultaneously. Mutexes and Conditions alone are a horrible way to deal with this as soon as there are two or more pieces of data. The “clojure” and “scala” links in my original post explain the advantages of the kind of STM system I’m working on, but TL;DR: a good solution to the very general multi-threaded synchronization problem is to make all/most sensitive data immutable (so that there is no danger in sharing it across threads) then to create a special “transactionally mutable” pointer/ref type that can be set only inside of transactionally synchronized blocks (and can only hold immutable data); synchronized transaction blocks ensure that all reads and changes to the refs inside them occur atomically, as if the system were single-threaded. This makes it much easier to write multi-threaded programs because you don’t need to know precisely how a multi-threaded function will be called, who else is accessing your data, or how the problem will be divided up across threads when you write the computation. It’s especially useful when you don’t know at the time of @spawn what data the thread you’re spawning will need to edit (like if the thread needs to read some shared state to determine what it’s going to do).

One other thing: I don’t buy the implication that correct/good Julia usage never mixes types in dictionaries/sets. The advent of data formats/structures like JSON and YAML as a common carrier of meta-data in scientific projects is a good counter-example. People are going to use nested dictionaries and vectors to store data, especially meta-data, and there isn’t a way to declare precise types for these without resorting to Any or a custom abstract type hierarchy. Whether they are efficient or not, they should work in an intuitive way. A powerful feature of persistent dictionaries/sets/vectors is that they can be safely used as keys in complex meta-data. If you’re skeptical that dictionaries would be useful as keys in the Julia ecosystem, take a look at the Brain Imaging Data Structure (BIDS) format spec, widely used across neuroimaging. Filenames in BIDS are effectively dictionaries with keys like subject-id, session, acquisition, task (“sub-xyz_ses-mri01_acq-normal_task-memory.nii.gz”). Each of the filenames is itself a key in the dictionary of the directory with the associated value being the contents of the file. A reasonable representation of this might be something like Dict{Dict{String,String}, <FileData>}. So I think that worrying about how the containers in a general library can support this kind of behavior (via objectid and === or otherwise) is a worthwhile concern, even if it is mostly to protect non-expert users.

2 Likes

You may have run across this already but there was an existing attempt at these “clojure”-esque data structures in jula called FunctionalCollections. It doesn’t seem to be in active developement though.

I would love to see clojure’s parallelism model ported to Julia, I wish you good fortune in your efforts!

2 Likes

Yes! I have a fork of this one. It’s not maintained anymore unfortunately (as far as I can tell anyway), so I opted to go with a design that was more in line with Julia’s mutable structures and less a translation of clojure (at least to my eye). Thanks!

It’s great to see that you are trying to bring a good support of persistent data structures in Julia. It would be great to have this in Julia for “fearless parallelism.” Generic threading support in Julia is pretty new so I think exploring new directions like this is highly beneficial.

Regarding the implementation, I think trying to make field types concrete is a bad approach because, IIUC, the main point of persistent data structures is structural sharing which requires some kind of indirection. So, it is better to accept type instability up to some degree. For this, I think Union field might be a good approach. Something like

const ConcreteSmallPVec{T} = Union{PVec0{T}, ... PVec32{T}}

struct PBigVec{T} <: PVec{T}
    n::Integer
    data::NTuple{32, Union{ConcreteSmallPVec{T}, PBigVec{T}}}
end

Maybe this still tries to inline everything into the struct. If that’s the case you need to use RefValue{PBigVec{T}} etc.

But I think you may still have a large dispatch penalty in a function like getindex because maybe the union is rather big for the union-splitting to kick in. To solve this, I think it would be helpful to “manually inline” dispatches, maybe using a macro equivalent to a helper function like this

function manual_dispatch(f, ::Type{T}, u) where T
    if u isa PVec0{T}
        return f(u)
    elseif u isa PVec1{T}
        return f(u)
    elseif ...
        ...
    elseif u isa PBigVec{T}
        return f(u)
    else
        return f(u)
    end
end

(I’m not sure if it can be done by this function as-is. I have tried a macro version of a similar thing and it worked well.)

This code seems to be completely useless but it helps the compiler since it can determine the concrete type of u inside each branches. This way, you can avoid dynamic dispatch. It may even inline f for small PSmallVecs.

Then, you can use it (or rather an equivalent macro) like this

Base.getindex(u::PBigVec{T}, k::Integer) where {T} = begin
    if k < 1 || k > n
        throw(BoundsError(u, [k]))
    else
        return manual_dispatch(T, u[((k - 1) >> 5) + 1]) do u
            u[((k - 1) & 0x1f) + 1]
        end
    end
end
1 Like

But this was never implied. Idiomatic usage just isolates type instability in places where it can cause little or no harm, see

https://docs.julialang.org/en/v1/manual/performance-tips/#kernel-functions-1

I agree that immutable containers make reasoning about multithreading much easier, and I think that Julia is getting there in the long run, but this may require more fundamental changes to the language than a package can implement. YMMV.

This makes sense as an implementation and is useful to know–it looks like I’m going to need to eventually test a number of possible implementations to see what works best, though. I had assumed that the implied union-type of an abstract base with a limited number of child classes would be a better clue for the compiler than a large union, but I guess there isn’t currently a way to declare that an abstract type like PSmallVec can’t be extended outside of the module, so the abstract type requires some flexibility by the compiler that isn’t necessary in the case of the union.

For now, given that immutable arrays are in the works for 2.0 or a version coming soon, I may go ahead and write these using ordinary arrays. This will break === and objectid until immutable arrays are out, but the more I think about it the more I suspect I will need to use an intermediate kind of equality to support mixing immutables and mutables anyway (actually, I already mocked up a version of this here). This will all depend on how strong of a line Julia eventually draws between immutable and mutable types, though.

1 Like