Functions with static table-like inputs and outputs

I often have to write functions that recalculate some arrays of structures (sorted by one of the fields) into other arrays of structures. I often had to do conversions from one array of structures into another to fit argument type. So I wanted to minimize this boilerplate code.

In previous version I passed every column into a separate argument, but then there were no guarantees that they have the same length. So here’s a template of such a function that I eventually came up with.

using StructArrays

"""
function finds intervals of matching symbols from sorted array of symbols
"""
function find_intervals(
    inp::AbstractVector{@NamedTuple{time::Int, sym::Symbol}}; 
    target_syms::Vector{Symbol} = [:A, :B], 
    break_syms::Vector{Symbol} = [:C, :D],
)
    # or StructVector{@NamedTuple{...}}[]
    out = Vector{@NamedTuple{tbeg::Int, tend::Int, count::Int, type::Symbol}}()

    count2sym = count->count < 2 ? :short : :long
    is_series = false
    tbeg = 1
    tend = 1
    count = 0
    for x in inp
        if x.sym in target_syms
            if ~is_series
                tbeg = x.time
                is_series = true
                count = 0
            end
            count += 1
        elseif x.sym in break_syms
            if is_series
                push!(out, (; tbeg, tend, count, type = count2sym(count)))
                is_series = false
            end
        else
            # other symbols don't break series and are not counted
        end
        tend = x.time
    end
    if is_series
        push!(out, (; tbeg, tend, count, type = count2sym(count)))
        is_series = false
    end

   return out
end

times = [10,20,30,40,50,60,70,80]
syms = [:A,:B,:C,:D,:A,:B,:C,:D]

# Problem 1: I have data either in columns or rows table that should both work:
cols = (time = times, sym = syms)
rows = [(time = t, sym = s) for (t, s) in zip(times, syms)]
# rows - can pass directly:
out = find_intervals(rows, target_syms=[:A, :C], break_syms=[:B])
# cols - wrap into a struct vector:
sv = StructVector(cols)
out = find_intervals(sv, target_syms=[:A, :C], break_syms=[:B])

# Problem 2: Column names and number do not fit with function signature:
cols_ = (T = times, T2 = 2 .* times, S = syms)
rows_ = [(T = t, T2 = 2t, S = s) for (t, s) in zip(times, syms)]
# rows - should copy into another rows with renamed fields (-)
cols = broadcast(rows_) do r
    selected = NamedTuple{(:T, :S)}(r)
    renamed = NamedTuple{(:time, :sym)}(values(selected))
end
out = find_intervals(cols, target_syms=[:A, :C], break_syms=[:B])
# cols - should select and rename columns, wrap into another struct vector (+no copy)
cols = StructVector((time = cols_.T, sym = cols_.S))
out = find_intervals(cols, target_syms=[:A, :C], break_syms=[:B])

What confuses me is that field renaming is redundant here. When I use NamedTuple in signature, it fixes both names and order of arguments. This seems redundant, because it usually happens at substitution of positional arguments into the function. Is there a way to declare local column arguments names inside function and check only their types?

Indeed, StructArrays are great for tables among other usecases – much better than juggling separate arrays.

Why do you want to put colnames into function signature at all? This not only constrains the field order, but also doesn’t allow adding new columns later, or to use structs other than namedtuples.

For mismatched names, there are two solutions:

  • rename them similar to what you do, but simpler:
(time=r.T, sym=r.S)
# or
NamedTuple{(:time, :sym)}(values(r[(:T, :S)]))
  • pass small accessor functions defining each element:
function f(ftime, fsym, tbl)
    ...
    for r in tbl
        ftime(r)  # instead of r.time
        ...
        push!(result, @set ftime(r) = newtime)  # can set ftime(r) if using Accessors
    ...
end

f(x->x.time, tbl)
f(x->x.T, tbl)
using Accessors
f(@optic(_.T), tbl)

The latter is more flexible, you can do stuff like f(x->x.T - 2000, tbl) if some transformation is needed.

Minor: note that you can use similar(inp, neweltype) for generality.

I don’t, but names already are in signature because of named tuples.

I mean, why do you constrain the signature to namedtuples at all? Just use inp::AbstractVector — this will work for any array of structs with corresponding names, and for many table types without any changes.

How do I know then if I am using it with incompatible argument? What fields must be present in that AbstractArray, what types are they etc.? I don’t want to mismatch fields the day I forget about this function and get runtime errors.

Well, then this is possible as you do it, with the issues you encountered. That’s why this approach of overly restrictive signatures is not common in Julia.

The question to ask is whether you need these types for dispatch, ie selecting between methods of a function.
If yes, then putting them into signature is fine, the right solution.
If not, you can just put a check at the beginning of the function, something like

function find_intervals(inp)
    @assert (:time, :sym) ⊆ fieldnames(eltype(inp))
end

You get the error reported when the function is run anyway, so there’s little difference in user experience.

1 Like