Convention for read-only array getters

We’ve recently wondered in DataFrames what convention should be followed when a getter function returns an array which is supposed to be read-only. In the case at hand, the names function returns the names of the columns as a Vector{Symbol}. We currently make a copy to ensure the user cannot corrupt this internal field (which needs to be kept consistent with other fields), but this is wasteful. In CategoricalArrays, the levels function has taken the opposite approach, returning a vector which should not be modified by the user. Each of these approaches has drawbacks.

A possible solution would be to define a ReadOnlyArray or ImmutableArray wrapper type, which would implement all the read-only AbstractArray interface, to be used in situations where the user shouldn’t be able to modify the array without first making a copy. Since this is a common pattern, that type would live in a lightweight package (or even in the stdlib).

How do people feel about this? Should we just trust users and directly return the internal Vector? Or is it worth using a safety read-only wrapper instead? It would be nice to have a standard solution to this across the ecosystem.

Cc: @bkamins

5 Likes

It feels like having the wrapper type just for arrays seems like the better solution to me - guarantees about user facing interfaces are always better, because they minimize possible error states which would have to be explained and dealt with every time this comes up.

The question is though, are your current users modifying the returned array and relying on that behaviour, regardless of whether they should or not? If so, you might be locked into your current solution.


Derped

Slightly OT: It feels like this would be useful for all kinds of different things though, not only for arrays or fields of structs. Having a wrapper type guaranteeing that a specific binding will not be modified through code using it also seems like a beneficial thing to have for various optimizations (though this is just my feeling here :slight_smile:). Expanding on the idea of a wrapper type, what would be needed to make general usage of it work? I guess getindex is allowed and gets passed through to the “Reference” and setindex could just error at runtime in a first implementation? The problem, of course, is that not just array operations are modifying, so in general this should probably be handled differently. Maybe that special type and accesses through it are not allowed to be on the left side of assignments?

Looks like I’ve derped, that’s what const is for :sweat_smile:

EDIT: Thinking on this a bit more, it feels like this is a natural extension of const to fields of structs, but recursively instead of just the binding.

2 Likes

Why not use for example SVector{Symbol} with

Because it’s elements are still mutable.

Well, no, if the elements themselves are mutable, then any kind of immutable array wrapper will make no difference whatsoever, so that issue is essentially orthogonal. An SVector is still probably not a good choice, since its type depends on the number of elements, which is not necessarily something that can be inferred by the compiler.

1 Like

@nalimilan: thank you for posting this.

Actually reading the discussion the issue is a bit delicate (probably you knew that :slight_smile:), as the wrapper will resize and change elements if the parent changes (so essentially it would be a read-only and resizing view). I think it is OK (and actually useful), but important to keep this in mind. For this reason SVector is not good, because it has a fixed size.

The simplest approach would be to reuse SubArray implementation but disallow setindex! operation and allow : as indices.

2 Likes

That’s a better explanation of what the goal is, I thought you were only trying to get static snapshots.

This would be good to have as the default behaior for view when using : indices (the resizing part, not the immutable part)

Views in the DataFrames.jl package (SubDataFrame and DataFrameRow) already will in the next release have implemented the feature that : on columns does dynamic resizing as this is a very common situation and is needed. For a normal view this is not that massively useful as view(x, :) when x is a vector is probably not very common and you cannot resize arrays having more than 1 dimension. But I agree that intuitively even for a normal view passing : should resize.

The idea of not going for a static snapshot is that creating such a read-only view will be much cheaper than creating a static snapshot (if it were possible to cheaply create a static snapshot it would be also OK, but it must be much more expensive than writing a thin wrapper that simply hides a setindex! function).

You might also override the getproperty function, so that the contained object can’t be accessed directly (which is bypassing the getindex and setindex! methods for the wrapper). Then use getfield for internal access.

Yeah, something like ReadOnlyArray would be great. I’ve been wanting an equivalent of flags["WRITEABLE"] = False in Numpy.

Since it is a short collection of symbols, I would just return them as a Tuple.

The datasets I work with in Stata have 20,000 variables pretty consistently

Point taken, but I don’t think DataFrames is ideal for this scenario in any case.

Well, most of that is bad data management that emerged because stata makes relational data tough because you have to store everything on disk. If we were using even R it would probably be better. Julia will be way better.

1 Like

I would think the opposite. If number of columns is very large other frameworks I know will have a significant precompile overhead on any operation mutating columns type. DataFrame is not typed so adding/removing/changing type/renaming/reordering etc. of the column is as fast for 10 as for 20’000 columns.

What alternative data structures do you have in mind for a huge number of columns (I am asking, because I might be unaware of some superior solution here)?

In case there is some structure, I would just have eg arrays and similar as elements of columns. R cannot do this, so people use column names like var.42.119.

The only thing that keeps it organized in Stata is meta-data like notes and variable labels. Which reminds me I need to renew my efforts and rebase my PR to get things rolling again.

2 Likes

ReadOnlyArray seems like a useful utility in its own right :+1:

Are there any actual downsides of using a wrapper like this in DataFrames?

1 Like

The only downside would be that one would use this ReadOnlyArray forgetting that it could be mutated if parent is mutated. But I think it is actually rare and easy to avoid and explain.

Ok, I don’t have a problem with the parent being mutated. It does indicate that ReadOnlyArray is a much better name for this thing than ImmutableArray.

1 Like