Count cumulative number of unique elements

I am trying to produce a species discovery curve, which is basically the number of unique species observed over time. So, at time t, the value is length(unique(_.species_name)) for all observations up to time t - this in itself is super easy to do, but I am looking for a way to do it using Query.jl.

Is there a mechanism to iterate? The issue I had with @groupby is that it only gives access to a single value when grouping by date, and I’m looking for a way to group by “all values lower than the current one”.

I realize that this isn’t necessarily a problem to solve with Query.jl but I’m trying “for fun”.

I don’t think there is anythng built-in here. But here is a function you can use

julia> function get_unique_cumulative(val)
       # assumes sorted by time
       s = Set{eltype(val)}()
       unique_across_time = Vector{Int}(undef, length(val))
       t = 0
       for i in 1:length(val)
           v = val[i]
           if v in s
              unique_across_time[i] = t
           else
              t += 1
              unique_across_time[i] = t
              push!(s, v)
           end
       end
       return unique_across_time
       end

in base dataframes you can do

julia> df = DataFrame(time = 1:25, val = rand(1:25, 25));

julia> transform(df, "val" => get_unique_cumulative => "cumulative_unique_species");
1 Like