Convention for read-only array getters

nalimilan · January 7, 2019, 3:56pm

We’ve recently wondered in DataFrames what convention should be followed when a getter function returns an array which is supposed to be read-only. In the case at hand, the names function returns the names of the columns as a Vector{Symbol}. We currently make a copy to ensure the user cannot corrupt this internal field (which needs to be kept consistent with other fields), but this is wasteful. In CategoricalArrays, the levels function has taken the opposite approach, returning a vector which should not be modified by the user. Each of these approaches has drawbacks.

A possible solution would be to define a ReadOnlyArray or ImmutableArray wrapper type, which would implement all the read-only AbstractArray interface, to be used in situations where the user shouldn’t be able to modify the array without first making a copy. Since this is a common pattern, that type would live in a lightweight package (or even in the stdlib).

How do people feel about this? Should we just trust users and directly return the internal Vector? Or is it worth using a safety read-only wrapper instead? It would be nice to have a standard solution to this across the ecosystem.

Cc: @bkamins

Sukera · January 7, 2019, 4:38pm

It feels like having the wrapper type just for arrays seems like the better solution to me - guarantees about user facing interfaces are always better, because they minimize possible error states which would have to be explained and dealt with every time this comes up.

The question is though, are your current users modifying the returned array and relying on that behaviour, regardless of whether they should or not? If so, you might be locked into your current solution.

Derped

Slightly OT: It feels like this would be useful for all kinds of different things though, not only for arrays or fields of structs. Having a wrapper type guaranteeing that a specific binding will not be modified through code using it also seems like a beneficial thing to have for various optimizations (though this is just my feeling here ). Expanding on the idea of a wrapper type, what would be needed to make general usage of it work? I guess getindex is allowed and gets passed through to the “Reference” and setindex could just error at runtime in a first implementation? The problem, of course, is that not just array operations are modifying, so in general this should probably be handled differently. Maybe that special type and accesses through it are not allowed to be on the left side of assignments?

Looks like I’ve derped, that’s what const is for

EDIT: Thinking on this a bit more, it feels like this is a natural extension of const to fields of structs, but recursively instead of just the binding.

chakravala · January 7, 2019, 4:52pm

Why not use for example SVector{Symbol} with

Azamat · January 7, 2019, 4:56pm

Because it’s elements are still mutable.

rdeits · January 7, 2019, 5:15pm

Well, no, if the elements themselves are mutable, then any kind of immutable array wrapper will make no difference whatsoever, so that issue is essentially orthogonal. An SVector is still probably not a good choice, since its type depends on the number of elements, which is not necessarily something that can be inferred by the compiler.

bkamins · January 7, 2019, 6:06pm

@nalimilan: thank you for posting this.

Actually reading the discussion the issue is a bit delicate (probably you knew that ), as the wrapper will resize and change elements if the parent changes (so essentially it would be a read-only and resizing view). I think it is OK (and actually useful), but important to keep this in mind. For this reason SVector is not good, because it has a fixed size.

The simplest approach would be to reuse SubArray implementation but disallow setindex! operation and allow : as indices.

chakravala · January 7, 2019, 9:54pm

That’s a better explanation of what the goal is, I thought you were only trying to get static snapshots.

This would be good to have as the default behaior for view when using : indices (the resizing part, not the immutable part)

bkamins · January 7, 2019, 10:08pm

Views in the DataFrames.jl package (SubDataFrame and DataFrameRow) already will in the next release have implemented the feature that : on columns does dynamic resizing as this is a very common situation and is needed. For a normal view this is not that massively useful as view(x, :) when x is a vector is probably not very common and you cannot resize arrays having more than 1 dimension. But I agree that intuitively even for a normal view passing : should resize.

The idea of not going for a static snapshot is that creating such a read-only view will be much cheaper than creating a static snapshot (if it were possible to cheaply create a static snapshot it would be also OK, but it must be much more expensive than writing a thin wrapper that simply hides a setindex! function).

chakravala · January 7, 2019, 10:15pm

You might also override the getproperty function, so that the contained object can’t be accessed directly (which is bypassing the getindex and setindex! methods for the wrapper). Then use getfield for internal access.

tkf · January 8, 2019, 1:20am

Yeah, something like ReadOnlyArray would be great. I’ve been wanting an equivalent of flags["WRITEABLE"] = False in Numpy.

Tamas_Papp · January 8, 2019, 8:39am

Since it is a short collection of symbols, I would just return them as a Tuple.

pdeffebach · January 8, 2019, 12:36pm

The datasets I work with in Stata have 20,000 variables pretty consistently

Tamas_Papp · January 8, 2019, 12:41pm

Point taken, but I don’t think DataFrames is ideal for this scenario in any case.

pdeffebach · January 8, 2019, 1:01pm

Well, most of that is bad data management that emerged because stata makes relational data tough because you have to store everything on disk. If we were using even R it would probably be better. Julia will be way better.

bkamins · January 8, 2019, 1:34pm

I would think the opposite. If number of columns is very large other frameworks I know will have a significant precompile overhead on any operation mutating columns type. DataFrame is not typed so adding/removing/changing type/renaming/reordering etc. of the column is as fast for 10 as for 20’000 columns.

What alternative data structures do you have in mind for a huge number of columns (I am asking, because I might be unaware of some superior solution here)?

Tamas_Papp · January 8, 2019, 1:38pm

In case there is some structure, I would just have eg arrays and similar as elements of columns. R cannot do this, so people use column names like var.42.119.

pdeffebach · January 8, 2019, 1:58pm

The only thing that keeps it organized in Stata is meta-data like notes and variable labels. Which reminds me I need to renew my efforts and rebase my PR to get things rolling again.

c42f · January 8, 2019, 11:41pm

ReadOnlyArray seems like a useful utility in its own right

Are there any actual downsides of using a wrapper like this in DataFrames?

bkamins · January 9, 2019, 6:52am

The only downside would be that one would use this ReadOnlyArray forgetting that it could be mutated if parent is mutated. But I think it is actually rare and easy to avoid and explain.

c42f · January 9, 2019, 6:54am

Ok, I don’t have a problem with the parent being mutated. It does indicate that ReadOnlyArray is a much better name for this thing than ImmutableArray.

Topic		Replies	Views
Defining array of constant values and prevent it to be modified inside functions New to Julia	7	2069	July 27, 2019
Make mutable struct read-only in local scope General Usage	1	40	November 4, 2024
How to deal with inconsistent objects due to mutation General Usage design	19	973	March 17, 2022
Immutable reference General Usage	4	601	June 27, 2021
Performance vs notation tradeoff: DataFrame vs AxisArray vs StaticArray vs? Performance dataframes , staticarrays , axisarrays , notation	2	239	August 20, 2024

Convention for read-only array getters

Related topics