Small changes on large arrays of custom types: mutable or immutable? shallow or deepcopy?

(I have previously asked this on the New to Julia category, but I think might rather belong here)

There are a few things somehow all related about the following situation concerning mutability to which I can not find a clear answer.

Consider a custom type

struct Single # or make mutable?
   # ... and more

that itself appears in:

struct DataCollection
   rawdata::Vector{Array} # not same length as structure

We now expect minor modifications such as:

function foo!(dc::DataCollection, w::Int)
   # change dc.structure[w].a
   # change rawdata[dc.structure[w].a]

Further, we expect applications such as:

dc1 = DataCollection() # make some random DataCollection
foo!(dc1,1) # ... and more operations
dc2 = copy(dc1) # in some way, possibly deepcopy
foo!(dc1,1) # ... but do not change dc2 in any way
foo!(dc2,1) # ... but do not change dc1 in any way

What I have so far considered are:

  1. make Single immutable, and let foo! do something like dc.structure[w] = Single(new_a,old_b,...)
    → the copy operation then only needs to be a shallow copy
  2. make Single mutable, and let foo! do something like dc.structure[w].a = new_a
    → the copy then needs to do a deepcopy of .structure
  3. make Single mutable, but still do something like dc.structure[w] = Single(new_a,old_b,...)

Question (a): When Single is immutable, is there even a difference between = copy(dc.structure) and = deepcopy(dc.structure)? In a short test, at least given many fields in Single, shallow copying a Vector of mutable Single's is indeed significantly faster. This would maybe make (3) much better than (1). The documentation somewhat states that mutable types are hardcopied, though I wonder a bit why, and anyway some more magic apparently happens that I unfortunately do not fully understand.

Question (b): Assuming shallow copies of immutable types are actually cheap, then approach (2) still avoids possibly very frequent reinitializations of Single type objects. It is also more flexible when dealing with Single in yet other functions as one can do things like for s in dc.structure; s.a = 1; end. As approach (2) has some advantages, is there some usual approach to best bookkeep these deepcopies at a minimum?

Question (c): Are there strict counterexamples how any of (1), (2) or (3) can cause other, serious drawbacks or pitfalls (meaning non-generic this is bad style arguments)

Question (d): Only approach (2) would in principle also allow to make .structure a Tuple in the sense that changing part of it does not necessitate copying the whole Tuple. But is this even beneficial?

Question (e): If .structure instead needs to be Dict as some bar! may remove some entry; how does delete!(dc.structure, w) depend on the mutability of Single?

This is a fairly complicated example and question with many parts, so excuse me for just commenting on one part that I’m interested in.

AFAIK, the point of mutable types is allowing multiple references to share an instance and its change. If I don’t need a change to be shared, reassigning a reference is sufficient, and the instance can be either mutable or immutable. I lean toward isbits types, which are both immutable and contain no references, for performance benefits; mutables and references usually require allocations of data scattered in memory.

Thing is, I do often make a type mutable if I want small changes like changing a field, even if I don’t need the full instance to be shared by multiple references. That’s because there’s a convenient setfield! and dot syntax for it. There is a Setfield.jl for doing the same thing with an immutable instance, but I didn’t ever figure out how to make it work for an immutable element of an array.

I also don’t know if it is possible for the compiler to implement reassignment of an immutable instance as an edit of a single field’s data, or if it has to make a whole instance from scratch like how it’s written. That would depend a lot on the implementation of immutable instances, and I don’t know how to read LLVM.

1 Like

Unless Single has many fields, option 1 is probably the best.

Let us say you keep DataCollection immutable. Immutable means you can not change any of its fields, but there’s a catch:

Vector is a mutable container, allocated on the heap: What is immutable in DataCollection is the pointer to where on the heap the content of the Vector is stored. The catch is you can overwrite all or part of the data in that Vector, on the heap.

A shallow copy of a DataCollection duplicates this pointer. Overwriting dc1.structure[w] = Single(...) will affect dc1 and dc2.

You are OK with that? copy. Not OK? deepcopy, in which case the vector on the heap is copied to a distinct vector on the heap.