Deepcopy dangers and drawbacks for nested data structures

Since reading this issue: Should `copy` be renamed `shallowcopy`? · Issue #42796 · JuliaLang/julia · GitHub

I have always avoided using deepcopy. Now I have a situation in which to define custom copy methods for my types is cumbersome, requires methods for many nested types, and causes a lot of code repetition. deepcopy, on the other hand, just works.

Can I fall into any trap by using deepcopy, in particular in the context of creating thread-local copies of data structures that must not be shared?

2 Likes

Tangential to your question lmiq, but just wanted to throw this in, if someone could explain for me.

I use deepcopy too and was not aware of this. Is the only drawback speed or was there more?

I didn’t get the Github Issue completely I think.

Kind regards

Because deepcopy recurses blindly through everything it can find, I think it can’t be trusted to have any specified meaning at all, rather than posing a risk of some particular danger. That’s a somewhat ideological argument but I think it’s right.

For example if v is a vector and x.a.b.c.d.e === v and y.a.b.c.d.e === v then deepcopy(x), deepcopy(y) breaks that connection, even if the semantics of copy(x) and copy(y) would require that the connection be preserved.

Imho implementing copy is a good idea. To make it more convenient to define structs, one can use a macro that defines ==, hash, copy, etc, specifying custom behavior for particular structs as needed.

It’s hard to say without a MWE, but I think it’s almost always a better idea to use custom constructors to create new objects rather than rely on deepcopy (even if writing those constructors is a little tricky).

2 Likes

It looks like this is no longer true if it ever was (I don’t know one way or the other).

Edit: I misread what you said, sorry about that. I don’t know how someone could expect anything else other than the result you describe however, copies are copies. It could be avoided with xcopy, ycopy = deepcopy((x,y)) if the connection needed to be preserved.

Reading the tea leaves, the docstring says “Calling deepcopy on an object should generally have the same effect as serializing and then deserializing it.”.

That suggests to me that, unless unusual data structures are involved (and the docstring has some advice there as well), deepcopy can be trusted, for some value of “trusted”. An expensive operation sometimes, sure, but that’s the nature of the beast.

1 Like

Indeed. The specific problem is that to construct the object I have to read a file that has to be accessed sequentially. Thus, I cannot (could not) construct in parallel.

But, that said, I was already implementing an alternative where I’ll read the data from the file before the threaded loop and create an intermediate structure that will allow me to use constructors in the parallel part. I think in my case that will be easier than implementing all copy methods that would be needed.

I can implement a custom copy function that iterates over the fields, but I don’t know if I trust my implementation of something semi-automatic like that more than what I trust deepcopy.

Like I said, you don’t have to type out Base.copy(::T) for every T manually, you can use or create a macro that defines the constructor/==/hash/copy behavior that you want, and just apply it to every struct definition where you want that behavior.

In principle, for a given T and a, implementing copy(x::T) requires specifying whether x.a === copy(x).a . It is not true for every type and property. Only the author of T can define whether that is true or not, so copy must be defined by that person.

1 Like

deepcopy is fundamentally and conceptually unsafe: You can’t copy a datastructure without knowing what it means. Especially for things containing pointers / external references.

To give a Base example:

julia> r=open("/tmp/foo", "w+")
IOStream(<file /tmp/foo>)

julia> println(r, "foo")

julia> r2=deepcopy(r)
IOStream(<file /tmp/foo>)

julia> close(r)

julia> println(r2, "foo")

julia> GC.gc()
malloc(): unsorted double linked list corrupted

[77602] signal (6.-6): Aborted
in expression starting at REPL[29]:1

You can look for other examples by searching github julialang for finalizer and cross-referencing with

julia> methods(Base.deepcopy_internal)
# 12 methods for generic function "deepcopy_internal" from Base:

That being said, deepcopy is sometimes useful and pragmatic. Just know that you can only use it on stuff where you know what’s inside.

3 Likes

That’s roughly what I was getting at with “for some value of ‘trusted’”, sure. But it isn’t as though the behavior of deepcopy is especially difficult to reason about. If someone is using deepcopy on an IOStream, that probably indicates that they don’t understand what an IOStream is, which yeah, fine, handle with care and all.

Right, it’s neither difficult to understand nor unsafe to apply to most things. Those would be my definitions of “conceptually unsafe” and “fundamentally unsafe”, respectively, but YMMV.

In this case, the topic was started by someone who has learned some ‘lore’ about deepcopy which is, perhaps, rooted in overthinking things.

So, this:

Not really, no. You can’t expect good things to happen if you’re copying something which holds state for a stateful process, especially involving syscalls, so y’know. An IOStream, don’t copy those. But you know the contents of your types, if they’re plain-old-data, deepcopy will work fine. If you’re unsure, you can look at the implementation and find out what will happen for things like Refs. If you’re really unsure, then yeah don’t use it.

3 Likes