Writing for generality

I re-discovered this old thread, which I started well more than a year ago, when I was a complete newbie:

I wanted to write a findall-like function that partitions the indices into two sets according to a criterion:

function partition(vals, thresh)
  idx_l = Vector{keytype(vals)}()
  idx_h = Vector{keytype(vals)}()
  for i in eachindex(vals)
    (vals[i] < thresh) ? push!(idx_l, i) : push!(idx_h, i)
  end
  (idx_l, idx_h)
end

Note the keytype(vals) bit. People argued, why dwell on the type of the index? Why not just use Int[]. Fair enough.

But, I’ve recently realized that the above function works for a Dict, as is! If you used Int, it wouldn’t work for a Dict.

For this particular case, the point became moot because an index-free solution was found:

function partition(v, thresh)
  idx_l = findall(<(thresh), v)
  idx_h = setdiff(eachindex(v), idx_l)
  (idx_l, idx_h)
end

which also works for a Dict. (I’m not comfortable with this solution, though, because the types of idx_l and idx_h are different.)

In general, you should assume minimum things on your variables as long as your code remains simple. For example, you shouldn’t assume that your index is one or another integer type unless doing so is necessary or it significantly simplifies your code.

Another example is Float64. I usually write Float64 without thinking much. But, sometimes I ask myself, why am I writing Float64 when any real number will do? Should I be writing Real instead?

But, this is hard to learn for a newcomer. There are no tutorials or textbooks to tell you something like the above. The community doesn’t seem to have consensus on the “best practice”.

2 Likes

It works for much more than just Vector and Dict - it works for anything that has keytype, getindex and eachindex!

In a sense, the “type” of vals is a type that has methods for each of the above. Writing for generality means asking the question “What is truly required of this function?”.

It’s not actually index free, because setdiff still operates on the indices of v, just like your single loop would have done. The only difference between the two is that your first version only loops over v (and its indices) once, while your second version necessarily loops twice (in the worst case of all elements in v being less than your treshold).

I think we don’t have a “best practice” for this because we can’t (acurately) describe what a function actually requires in a type. We can’t currently annotate v with a type describing the exact methods required/expected by partition and expect it to behave/work for types that don’t subtype that (abstract) type.

To me, generic programming is thinking about behaviours and abstract properties you can assume to hold. I.e., in your example the generic function partition should be such that

all(getindex.(Ref(v), il) .< thresh) && all(getindex.(Ref(v), ih) .>= thresh) && union(Set(getindex.(Ref(v), il)), getindex.(Ref(v), ih)) == Set(values(v))

whenever il, ih = partition(v, thresh).
As long as these properties hold, I don’t care about the actual concrete types representing v or il and ih. All that matters is that the generic function can be used in the intended way, i.e., as specified by its properties. Another context to thing about is linear algebra, i.e., what do you need to know about the types in order to use a matrix inverse inv(A)? Instead, think about its properties …