Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs

maxkapur · August 5, 2023, 12:32am

Lately, I have been running into a lot of tasks where I generate a big long vector of instances of some kind of data struct:

julia> struct Foo; a::Int; b::Int; end

julia> foos = [Foo(rand(1:4), rand(1:4)) for _ in 1:10000]
10000-element Vector{Foo}:
 Foo(3, 4)
 Foo(1, 4)
 ...

Then I do some subsequent processing that is, essentially, a bunch of database select and join operations on foos. Something like:

julia> for foo in foos, bar in foos
           if foo.a + bar.b == 3
               # do stuff
           end
       end

(even this empty block takes like 15 seconds to run on this ancient netbook).

I realize that this, as written, is an O(n^2) operation, and I am committing the same error made by the authors of that one GAMS blog post that was widely criticized on these forums.

The solution in the JuMP response to the GAMS post is to use DataFrames.jl which provides the select and join operations needed to run the “query” efficiently.

But something about importing an entire dataframes library seems like overkill to me. DataFrames.jl is more than selects and joins; it also provides vectorized operations so you can do an interactive split-apply-combine workflow, and lots of other utilities for imputing nans, reshaping, etc. That’s a lot of heavy machinery that is not needed for simple problems such as the GAMS example, where we only need to run, like, one query.

Is there any kind of data structure that offers efficient select and join operations without the overhead of a full-blown dataframe?

Putting this in off-topic because I don’t necessarily care about Julia-specific answers (actually, I thought of this question today while contemplating whether I needed to pull a pandas dependency into a Python project that instantiates a big list of dataclasses.dataclass instances).

George9000 · August 5, 2023, 3:40am

Possibilities to explore

Sometimes the long way around is the shortest way there. So, despite above, DataFrames.jl may still be the best approach.

aplavin · August 5, 2023, 10:00am

One nice “feature” of Julia is that one doesn’t even need special custom data structures for efficient and convenient tabular operations (like select, group, join, etc).
No need to create and learn new types: you foos array works for tabular operations without any changes!

Julia Base provides map and filter, packages provide more: SplitApplyCombine.jl — a little of everything, FlexiJoins.jl — different joins, etc.

Recently, I also registered the DataManipulation.jl package that aims to be an all-in-one solution for data manipulation with native Julia structures (like arrays). It both aggregates existing packages with compatible api and philosophy for convenience, and defines some functions itself.

lmiq · August 5, 2023, 11:15am

What if you write a funcion with that?

maxkapur · August 5, 2023, 3:25pm

OK, I spent some time constructing a more representative example and writing a little benchmark. I think the issue is that while the problem I’m trying to solve is strongly O(n^2), in that (in the worst case) we really do need to check every foo against every other foo.

In the contrived example below, the end goal is essentially just to calculate

sum(
    0.5 * (foo.a + bar.b)
    for foo in foos for bar in foos
    if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
)

Results

In the benchmark results, the fastest way to do this was essentially to do exactly what I have written above, checking the whole if for every foo, bar pair: query_forloop(), with median time 3.7 microseconds.
I tried to be clever by filtering for foo where foo.a == 1 and bar where bar.b > 2 in the outer loop, but this was slower: query_itertools(), 6.4 microseconds.
A similar approach with a TypedTable was slower: query_typedtable(), 47 microseconds.
As was a DataFrame (query_dataframe(): 590 microseconds), although I suspect there are ways to reduce the allocations in this function, as it has been a long time since I did a lot of DataFrames.jl work.

Remaining questions

Is my benchmark accurate? I always seem to fall into one of the pitfalls with these.
Can the table and dataframe approaches be improved?

Benchmark details

main.jl:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg


"Data struct."
struct Foo
    a::Int
    b::Int
    c::Int

    "Random, representative instance of this data struct."
    function Foo()
        return new(rand(1:5), rand(1:5), rand(1:5))
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
a for loop. Julia style?
"""
function query_forloop(foos::Vector{Foo})
    res = 0.0
    for foo in foos, bar in foos
        if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
            res += 0.5 * (foo.a + bar.b)
        end
    end
    return res
end

"""
Typical calculation I want to run on a `Vector{Foo}` using
various built in iteration helpers. Python style?
"""
function query_itertools(foos::Vector{Foo})
    return sum(
        Iterators.filter(
            Iterators.product(
                filter(foo -> foo.a == 1, foos),
                filter(bar -> bar.b > 2, foos),
            ),
        ) do (foo, bar)
            foo.a + bar.c == 3
        end,
        init = 0.0,
    ) do (foo, bar)
        0.5 * (foo.a + bar.b)
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
`TypedTables.jl`.
"""
function query_typedtable(foos::Vector{Foo})
    # I realize this NamedTuple conversion adds additional overhead, but 
    # changing the representation of Foo is not within the parameters of the
    # question, so we have to count this time.
    # This is pretty hard to read, huh?
    table = Table(
        (
            foo_a = foo.a,
            foo_b = foo.b,
            foo_c = foo.c,
            bar_a = bar.a,
            bar_b = bar.b,
            bar_c = bar.c,
        ) for (foo, bar) in Iterators.filter(
            Iterators.product(
                filter(foo -> foo.a == 1, foos),
                filter(bar -> bar.b > 2, foos),
            ),
        ) do (foo, bar)
            foo.a + bar.c == 3
        end
    )

    return sum(table, init = 0.0) do row
        0.5 * (row.foo_a + row.bar_b)
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
`DataFrames.jl`.
"""
function query_dataframe(foos::Vector{Foo})
    df = DataFrame(foos)
    foo_subset = subset(df, :a => a -> a .== 1)
    bar_subset = subset(df, :b => b -> b .> 2)
    xjoin = crossjoin(foo_subset, bar_subset, renamecols = "_foo" => "_bar")

    return sum(
        # note: not sure if falling back to Julia's filter is the right 
        # move here
        filter(eachrow(xjoin)) do row
            row.a_foo + row.c_bar == 3
        end,
        init = 0.0,
    ) do row
        0.5 * (row.a_foo + row.b_bar)
    end
end


const SIZE::Int = 100

function main()
    @show VERSION

    foos = [Foo() for _ = 1:SIZE]

    @assert (
        query_forloop(foos) ==
        query_itertools(foos) ==
        query_typedtable(foos) ==
        query_dataframe(foos)
    )

    for fun in (query_forloop, query_itertools, query_typedtable, query_dataframe)
        println(nameof(fun))
        display(@benchmark $fun($foos))
    end
end

main()

Then

julia --project main.jl | tee results.txt

yields results.txt:

VERSION = v"1.9.2"
query_forloop
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.776 μs …  10.196 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.797 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.104 μs ± 669.340 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂   ▅▃▁                                                    ▁
  ███▆▅████▇▆▆▆▆▇▇▆▇▇▆▇█▇█▇▇▇▇█▇██████▇▇█▇▇▇▇▆▆▆▆▅▅▅▆▆▅▅▅▅▃▄▃ █
  3.78 μs      Histogram: log(frequency) by time      6.79 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
query_itertools
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.708 μs … 491.093 μs  ┊ GC (min … max): 0.00% … 95.08%
 Time  (median):     6.401 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.051 μs ±  10.420 μs  ┊ GC (mean ± σ):  3.42% ±  2.30%

     ▁    █▆                                                   
  ▂▃▄█▇▄▅▆██▅▂▁▁▁▁▁▁▁▁▁▁▂▁▂▃▂▂▂▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.71 μs         Histogram: frequency by time        10.3 μs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.
query_typedtable
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  42.745 μs …  2.761 ms  ┊ GC (min … max): 0.00% … 87.32%
 Time  (median):     47.035 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.555 μs ± 68.210 μs  ┊ GC (mean ± σ):  3.96% ±  3.26%

  ▇█▆▇▇▆▅▄▃▃▂▂▂▁       ▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁                   ▂
  ████████████████████████████████████████████▇▇▆▆▆▅▅▅▅▆▆▅▅▅▄ █
  42.7 μs      Histogram: log(frequency) by time       103 μs <

 Memory estimate: 63.66 KiB, allocs estimate: 23.
query_dataframe
BenchmarkTools.Trial: 7794 samples with 1 evaluation.
 Range (min … max):  544.023 μs …   2.824 ms  ┊ GC (min … max): 0.00% … 65.21%
 Time  (median):     590.213 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   634.502 μs ± 168.265 μs  ┊ GC (mean ± σ):  1.21% ±  4.71%

  ▆█▅▄▆▆▅▄▃▃▂▂▂▂▂▂▂▁▁                                           ▂
  ██████████████████████▇█▇▇▆▇▆▆▇▆▆▆▆▆▇▇▆▇▇█▇▇▆▆▆▆▇▆▅▆▅▅▅▅▅▄▄▅▄ █
  544 μs        Histogram: log(frequency) by time       1.22 ms <

 Memory estimate: 223.55 KiB, allocs estimate: 7216.

maxkapur · August 5, 2023, 3:32pm

It gets compiled away, of course But my original question was about order of complexity, specifically if there’s a way to reduce the runtime from O(n^2) if you happen to know that the number of foo, bar pairs that pass the condition is small (in the problem I was working on this week, proportional to n rather than n^2). (And, of course, the # do stuff part is nonempty.)

But since I got such great suggestions for Julia packages to try, I wrote a Julia benchmark anyhow, so the focus of my question has shifted a bit.

maxkapur · August 5, 2023, 3:41pm

Tried to be clever but the results are middling:

"""
Branchless version of `query_forloop()`.
"""
function query_branchless(foos::Vector{Foo})
    return sum(
        (foo.a == 1) * (bar.b > 2) * (foo.a + bar.c == 3) * 0.5 * (foo.a + bar.b)
        for foo in foos for bar in foos,
        init = 0.0
    )
end

query_branchless
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.652 μs …  1.843 ms  ┊ GC (min … max): 0.00% … 95.73%
 Time  (median):     16.401 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.502 μs ± 18.567 μs  ┊ GC (mean ± σ):  1.01% ±  0.96%

  ▇█▆▂ ▁▅▃▁▆▃▃▁  ▅▅▅▄▂ ▂▃▄▄▃▁  ▁▁▁▁                           ▂
  ████▇████████▄▇█████▇███████▆████▇▆▆▆▆▆▇▇▇▇▇▆▆▆▆▆▆▆▅▅▆▅▄▅▄▄ █
  14.7 μs      Histogram: log(frequency) by time      28.4 μs <

lmiq · August 5, 2023, 4:24pm

You could use

for i in 1: length(foo), j in i+1: length(foo)

To avoid running over repeated pairs. Something else would be a different computation than your first example, and more specific details of your actual problem would be needed. I don’t think any package can avoid this O(n^2) loop in principle.

joa-quim · August 5, 2023, 5:15pm

But the code does nothing so the timings are meaningless no?

maxkapur · August 5, 2023, 5:33pm

The code in the benchmark does something (click the arrow next to Details" to expand)

rocco_sprmnt21 · August 5, 2023, 8:31pm

It’s not clear to me where bar.c is defined.
Whatever number it is, if foo.a==1 then instead of checking foo.a + bar.c == 3, isn’t it easier to check that bar.c==2?

johnmyleswhite · August 5, 2023, 8:50pm

So I think this problem can actually be solved in O(N log(N)) because you can sort the data and use the sorting invariant to optimize the iteration.

Assuming I’ve gotten things right, I think the following code shows how to use a 3SUM like approach to do this sum faster:

julia> x = [(a = rand(1:10), b = rand(1:10), c = rand(1:10)) for i in 1:100_000];

julia> function faster_sum(x)
           is = sortperm(x, lt = (t1, t2) -> t1.a < t2.a)
           js = sortperm(x, lt = (t1, t2) -> t1.c < t2.c)

           filter!(i -> x[i].a == 1, is)
           filter!(j -> x[j].b > 2, js)

           m = 1
           n = length(js)

           s = 0.0

           while m <= n
               if m > length(is)
                   return s
               end

               while n > 0 && x[is[m]].a + x[js[n]].c > 3
                   n -= 1
               end

               if n == 0
                   return s
               end

               min_n = 0
               if x[is[m]].a + x[js[n]].c == 3
                   max_n = n
                   while n > 0 && x[is[m]].a + x[js[n]].c == 3
                       s += 0.5 * (x[is[m]].a + x[js[n]].b)
                       n -= 1
                   end
                   min_n = n
                   n = max_n
                   m += 1
               else x[is[m]].a + x[js[n]].c < 3
                   m += 1
                   n = min_n - 1
               end
           end

           s
       end
faster_sum (generic function with 1 method)

julia> faster_sum(x)
2.9932832e8

julia> sum(
           0.5 * (foo.a + bar.b)
           for foo in x for bar in x
           if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
       )
2.9932832e8

rocco_sprmnt21 · August 5, 2023, 9:35pm

an alternative. Probably not faster but from a different point of view

function falls(foos)
    A=findall(f->f.a==1, foos)
    C=findall(f->f.c==2, foos)
    B=findall(f->f.b>2, foos)
    BC=intersect(B,C)
    (sum(f->f.b,foos[BC])+length(BC))*length(A)*0.5
end

the faster form

function falls(foos)
    A=findall(f->f.a==1, foos)
    BC=findall(f->f.c==2&&f.b>2, foos)
    (sum(f->f.b,foos[BC])+length(BC))*length(A)*0.5
end

maxkapur · August 5, 2023, 9:35pm

In the slightly more fleshed out example I gave in my follow-up comment, the struct has three fields, a, b, and c.

Yes, in this case you can just math out that the condition is equivalent to bar.c == 2, but this is a contrived example for the purposes of illustration. The point is that we have a boolean filter condition that depends on the specific foo, bar pair.

lmiq · August 5, 2023, 10:09pm

In my humble opinion, there is not much sense in discussing these alternatives without the exact use case. Being able, or not, to define a O(n) algorithm for a pairwise comparison like that is dependent on the nature of the comparison and the properties of the elements. A whole class of problems where such things can be done are those tackled by nearest neighbors algorithms. But they depend on the definition of some distance metrics.
But for other problems there can’t be such an algorithm. So, the closer MWE gets to the true problem, the likelier is to actually find an optimal solution.

maxkapur · August 5, 2023, 10:13pm

Had to study this for a bit, but I think I get it (gratuitously changing symbols in an effort to cement my own understanding):

Make two copies of the original list, call then baz and bar.
Filter baz and bar in place, so that baz only contains those that satisfy the .a == 1 condition, and bar only contains those that satisfy the .b > 2 condition.
Now sort baz and bar in place, the former by .a, and the latter by .c. (The order of steps 2 and 3 doesn’t matter.)

Everything up until this point was O(n \log n).

Start one pointer i at the beginning of baz and another pointer j at the end of bar.
Let res = 0.
While i < j:
a. If baz[i] + bar[j] > 3, then decrement j (which weakly decreases the value of baz[i] + bar[j] at the next loop)
b. If < 3, then increment i (which weakly increases the value of baz[i] + bar[j] at the next loop)
c. If == 3, then baz and bar satisfy the condition, so increment res by 0.5 * (baz[i].a + bar[j].b). Now increment i or decrement j (doesn’t matter which) and proceed.

This is all O(n), because i and j start out n-1 apart, and the gap closes by 1 at every iteration.

As is, this method will fail at step 6c if there are duplicates in the original list. Suppose that the input is

[ .... , foo1, foo2, ..., bar1, bar2, ... ]

and all of the pairs foo1, bar1, foo1, bar2, foo2, bar1, and foo2, bar2 satisfy the condition. Then when the pointers are at i: foo1 and j: bar2, you have to either increment i (missing the pair foo1, bar1) or decrement j (missing the pair foo2, bar2).

(Note that the 3SUM problem you linked allows you to assume that the original array contains no duplicates, because the entries are just numbers anyway.)

I think this part of your loop

                   while n > 0 && x[is[m]].a + x[js[n]].c == 3
                       s += 0.5 * (x[is[m]].a + x[js[n]].b)
                       n -= 1
                   end

is what corrects for the possibility of duplicates, but doesn’t this inner loop take O(n) time, rendering the whole thing O(n^2) in the worst case?

Actually, I suppose an accurate solution (if you allow duplicates) has to have O(n^2) runtime, because if the original list is just n copies of (a = 1, b = 3, c = 2), then the condition holds for every pair. What this illustrates is that I haven’t contrived my example very well . Coming up with representative data is never as easy as you think…

johnmyleswhite · August 6, 2023, 12:13am

You’re right: my code is not very careful about duplicates. That inner loop is meant to take care of duplicates, but I think it actually has the bad property (in the context of the rest of the code) that it misses some duplicates.

I think for this problem it’s maybe useful to distinguish two types of theoretical analysis:

Worst case behavior relative to the input length.
Worst case behavior relative to the number of tuples in the Cartesian product that satisfy the predicate you filter on.

I think you’re right that it’s easy to construct examples where all tuples in the Cartesian product satisfy the predicate, so in those cases you’re always just going to have to iterate over all O(N^2). So analysis (1) means you can’t ever do better than O(N^2).

But I suspect it’s often useful to contrast algorithms based on how much they exhibit O(N^2) behavior when the number of tuples that satisfy the predicate is very low, say closer to O(N) or even O(1).

Perhaps useful to contrast DataFrames performance with handwritten solutions relative to analysis (2).

rocco_sprmnt21 · August 6, 2023, 6:14am

If it were possible, in your real case, as in the example you reproduced to separate the for loops, the execution times could be drastically reduced

foos = [Foo() for _ = 1:100]

function query_forloop(foos::Vector{Foo})
    res = 0.0
    for foo in foos, bar in foos
        if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
            res += 0.5 * (foo.a + bar.b)
        end
    end
    return res
end
function query_forloop1(foos::Vector{Foo})
    res = 0.0
    #for foo in foos
        #if foo.a == 1 
            for bar in foos
                if bar.b > 2 && bar.c == 2
                    res += 0.5 * (1 + bar.b)
                end
            end
        #end
    #end
    return res
end

function query_forloop1(foos::Vector{Foo})
    res = 0.0
    
            for bar in foos
                if bar.b > 2 && bar.c == 2
                    res += 0.5 * (1 + bar.b)
                end
            end
            res*=count(f->f.a==1,foos)
    return res
end


julia> using BenchmarkTools

julia> @btime query_forloop(foos) 
  2.411 μs (0 allocations: 0 bytes)
715.0

julia> @btime query_forloop1(foos)  
  121.437 ns (0 allocations: 0 bytes)
715.0

rocco_sprmnt21 · August 6, 2023, 9:19am

maxkapur:

function query_dataframe(foos::Vector{Foo})
    df = DataFrame(foos)
    foo_subset = subset(df, :a => a -> a .== 1)
    bar_subset = subset(df, :b => b -> b .> 2)
    xjoin = crossjoin(foo_subset, bar_subset, renamecols = "_foo" => "_bar")

    return sum(
        # note: not sure if falling back to Julia's filter is the right 
        # move here
        filter(eachrow(xjoin)) do row
            row.a_foo + row.c_bar == 3
        end,
        init = 0.0,
    ) do row
        0.5 * (row.a_foo + row.b_bar)
    end
end

Still speculating in a generic way on the problem (in the proposed MWE version), I note that innerjoin() could be preferable to the use of subset()

julia> function subsdf(df)
           dfj=DataFrame(b=[3, 4,5],c=2)
           innerjoin(df,dfj, on=[:b,:c])
       end
subsdf (generic function with 1 method)

julia> @btime subsdf(df)
  11.600 μs (188 allocations: 14.24 KiB)
9×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      2
   2 │     1      3      2
   3 │     5      3      2
   4 │     3      4      2
   5 │     3      4      2
   6 │     4      5      2
   7 │     1      5      2
   8 │     3      5      2
   9 │     2      5      2

julia> @btime subset(df, [:b,:c] => (b,c) ->@. (b > 2) && (c==2))
  27.900 μs (198 allocations: 10.67 KiB)
9×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      2
   2 │     1      3      2
   3 │     5      3      2
   4 │     3      4      2
   5 │     3      4      2
   6 │     4      5      2
   7 │     1      5      2
   8 │     3      5      2
   9 │     2      5      2

maxkapur · August 6, 2023, 1:27pm

Better MWE

Thank you all for your responses. They have helped me see which features of my dataset are relevant to the problem at hand, and enabled me to come up with a more representative minimal example now. Consider data like this:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random

const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}

function makerecords(size::Int = 500)::Vector{Record}
    names = split("alice bob carl dana")
    subjects = split("english french math robotics swimming")
    terms = 2011:2011 + size

    res = [
        (; name, subject, term, score = rand())
        for name in names, subject in subjects, term in terms
    ][:]
    # shuffle for fair comparison
    shuffle!(res)
    # delete some random entries
    return res[1:floor(Int, 0.9 * end)]
end

So, we have a bunch of academic records for different students, in different terms, and different subjects. The condition for a “match” between two records record1 and record2 is:

The same student (same name)
The same term
The score in record1 is higher than in record2

For every pair of records matching these conditions, we want to add the score difference to the running total.

The way the data is constructed, there are only 5 subjects total. Thus, for a given student, term pair, we have at most a constant 5 (5 - 1) / 2= 10 score differences to add to the running total. Therefore, if we increase the number of students or terms in the dataset, although the number possible pairs of records increases quadratically, the number of matching pairs increases only linearly.

The naive approach is as follows:

function query_naive(records::Vector{Record})
    res = 0.0
    for record1 in records, record2 in records
        if (
            record1.name == record2.name &&
            record1.term == record2.term &&
            (diff = record1.score - record2.score) > 0.0
        )
            res += diff
        end
    end
    return res
end

Median execution time: 725 ms.

By spending O(n\log n) time to sort the records by name, then term, then score, we can use an approach like @johnmyleswhite suggested to keep track of the range of records matching a particular name, term pair and skip comparisons that have no chance of matching:

function query_sort(records::Vector{Record})
    records = sort(records, by = record -> (record.name, record.term, record.score))

    res = 0.0
    i = 1
    while i in keys(records)
        j = i+1
        while (
            j in keys(records) &&
            records[i].name == records[j].name &&
            records[i].term == records[j].term
        )
            # note: don't need to condition on this because we sorted by score above
            # @assert records[j].score - records[i].score ≥ 0
            res += records[j].score - records[i].score
            j += 1
        end
        i += 1
    end
    return res
end

In the benchmark results, this is the fastest method with median time 3.9 ms. (And could be made even faster by sorting in place at the top.) If the number of subjects is constant, then the inner while loop has constant time, so the outer while loop is O(n), and the whole function is O(n \log n) because of the sorting.

But this approach makes me a little uneasy—how do I know, other than on unit tests with fake data, that I’m not missing any matches? I feel more confident using the following DataFrames approach, which is doing the same thing, but uses groupby to ensure that I am actually grabbing all of the relevant records:

function query_dataframe(records::Vector{Record})
    df = DataFrame(records)
    res = 0.0
    for student_term_df in groupby(df, [:name, :term])
        ax = axes(student_term_df, 1)
        for i in eachindex(ax), j in eachindex(ax)
            if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
                res += diff
            end
        end
    end
    return res
end

This had a median execution time of 8.5 ms, which is not as good as query_sort(), but it’s on the same order of magnitude, and it should be the same order of complexity (the inner for loops runs in constant time for a fixed number of subjects, although the constant is now 5 \times 5 instead of 5 (5-1) / 2). It is certainly good enough for my purposes.

I hope that this illustrates the point I was trying to make in the original post:

Although importing DataFrames.jl is not necessary to solve this problem efficiently—we can manually achieve the same efficiency with careful sorting and keeping track of pointers—it does make the efficient solution a lot easier to write.
The dataframe solution is not just easier in terms of fewer lines of code, but it is also easier to look at the code and see that it is correct. There is no manual sorting or checking if indices are valid.
I can see other ways to solve the problem, too, such as constructing a dictionary that maps each unique student, term pair to the associated record indices, and then calling a function on the values of the dictionary.
- My issue isn’t “I don’t know how to code this”—I’ll code it if I have to—it’s that in a production context, any additional representations of my dataset (in this case, a dictionary) generate technical debt that I have to worry about maintaining and updating correctly if we insert a new record. (It doesn’t help that most of my work is in Python, where these things are all instance variables that have to be updated in a specific order and it is easy to lose track of what changes when.)
- If all of my records exist in a dataframe, then I have only to keep the dataframe current, and I feel confident that my selection queries will always be correct.

This is why I find myself longing for a “lightweight alternative” to a dataframe. Although I haven’t looked at the source code, my understanding is that dataframe libraries like DataFrames.jl provide efficient sort, select and groupby operations by maintaining things like a sortperm of each column as a linked list, updating the sortperm each time a new record is added. This obviates the need for “manual” sorting as done in sort_query(), and eliminates the need to come up with a bespoke sort order for every different kind of query I want to run on vector of records. Hence my question:

Is there something out there that provides this functionality without the full overhead of DataFrames.jl? Or am I underestimating the amount of work DataFrames.jl is doing under the hood here, and I should accept that it is the right-sized solution if I am unwilling to write functions like sort_query() myself?

Benchmark details

Full benchmark script:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random

const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}

function makerecords(size::Int = 500)::Vector{Record}
    names = split("alice bob carl dana")
    subjects = split("english french math robotics swimming")
    terms = 2011:2011 + size

    res = [
        (; name, subject, term, score = rand())
        for name in names, subject in subjects, term in terms
    ][:]
    # shuffle for fair comparison
    shuffle!(res)
    # delete some random entries
    return res[1:floor(Int, 0.9 * end)]
end


function query_naive(records::Vector{Record})
    res = 0.0
    for record1 in records, record2 in records
        if (
            record1.name == record2.name &&
            record1.term == record2.term &&
            (diff = record1.score - record2.score) > 0.0
        )
            res += diff
        end
    end
    return res
end


function query_sort(records::Vector{Record})
    records = sort(records, by = record -> (record.name, record.term, record.score))

    res = 0.0
    i = 1
    while i in keys(records)
        j = i+1
        while (
            j in keys(records) &&
            records[i].name == records[j].name &&
            records[i].term == records[j].term
        )
            # note: don't need to condition on this because we sorted by score above
            # @assert records[j].score - records[i].score ≥ 0
            res += records[j].score - records[i].score
            j += 1
        end
        i += 1
    end
    return res
end


function query_dataframe(records::Vector{Record})
    df = DataFrame(records)
    res = 0.0
    for student_term_df in groupby(df, [:name, :term])
        ax = axes(student_term_df, 1)
        for i in eachindex(ax), j in eachindex(ax)
            if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
                res += diff
            end
        end
    end
    return res
end


function main()
    @show VERSION

    records = makerecords()

    @assert (
        query_naive(records) ≈
        query_sort(records) ≈
        query_dataframe(records)
    )

    for fun in (query_naive, query_sort, query_dataframe)
        println(nameof(fun))
        display(@benchmark $fun($records))
    end
end

main()

Results:

VERSION = v"1.9.2"
query_naive
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  697.862 ms … 756.464 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     725.073 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   726.045 ms ±  19.627 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                █ █        ██                   █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  698 ms           Histogram: frequency by time          756 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
query_sort
BenchmarkTools.Trial: 1132 samples with 1 evaluation.
 Range (min … max):  3.953 ms …   8.178 ms  ┊ GC (min … max): 0.00% … 8.34%
 Time  (median):     4.166 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.400 ms ± 610.375 μs  ┊ GC (mean ± σ):  0.28% ± 1.57%

  █▆▁                                                          
  ███▇▆▅▄▄▄▄▅▄▄▃▄▃▃▃▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▁▂▁▂▁▂▂▂▂▂▂▂▂ ▃
  3.95 ms         Histogram: frequency by time        7.27 ms <

 Memory estimate: 1.10 MiB, allocs estimate: 4.
query_dataframe
BenchmarkTools.Trial: 522 samples with 1 evaluation.
 Range (min … max):  8.545 ms … 15.950 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.945 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.566 ms ±  1.301 ms  ┊ GC (mean ± σ):  1.31% ± 3.42%

  ▄██▆▄  ▁▁▂▃▄▃▃▁▂                                            
  █████▆██████████▇█▇▇█▄▇▅▅▅▅▁▆▁▆▄▄▆▇▆▄▁▁▁▁▁▁▁▁▅▄▁▁▄▄▄▄▁▁▅▄▅ ▇
  8.54 ms      Histogram: log(frequency) by time     15.1 ms <

 Memory estimate: 4.75 MiB, allocs estimate: 237092.

Topic		Replies	Views
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2041	September 15, 2018
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1117	November 15, 2022
What have we learned from DataFrames in Julia? Community poll	4	1630	November 29, 2017
What's the best way to work with millions of rows of data? Performance	7	2047	February 24, 2020
DataFrames Complexity Data dataframes	5	544	September 23, 2020

Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs

Results

Remaining questions

Benchmark details

Better MWE

Benchmark details

Related topics