Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs

Lately, I have been running into a lot of tasks where I generate a big long vector of instances of some kind of data struct:

julia> struct Foo; a::Int; b::Int; end

julia> foos = [Foo(rand(1:4), rand(1:4)) for _ in 1:10000]
10000-element Vector{Foo}:
 Foo(3, 4)
 Foo(1, 4)
 ...

Then I do some subsequent processing that is, essentially, a bunch of database select and join operations on foos. Something like:

julia> for foo in foos, bar in foos
           if foo.a + bar.b == 3
               # do stuff
           end
       end

(even this empty block takes like 15 seconds to run on this ancient netbook).

I realize that this, as written, is an O(n^2) operation, and I am committing the same error made by the authors of that one GAMS blog post that was widely criticized on these forums.

The solution in the JuMP response to the GAMS post is to use DataFrames.jl which provides the select and join operations needed to run the β€œquery” efficiently.

But something about importing an entire dataframes library seems like overkill to me. DataFrames.jl is more than selects and joins; it also provides vectorized operations so you can do an interactive split-apply-combine workflow, and lots of other utilities for imputing nans, reshaping, etc. That’s a lot of heavy machinery that is not needed for simple problems such as the GAMS example, where we only need to run, like, one query.

Is there any kind of data structure that offers efficient select and join operations without the overhead of a full-blown dataframe?

Putting this in off-topic because I don’t necessarily care about Julia-specific answers (actually, I thought of this question today while contemplating whether I needed to pull a pandas dependency into a Python project that instantiates a big list of dataclasses.dataclass instances).

Possibilities to explore

Sometimes the long way around is the shortest way there. So, despite above, DataFrames.jl may still be the best approach.

One nice β€œfeature” of Julia is that one doesn’t even need special custom data structures for efficient and convenient tabular operations (like select, group, join, etc).
No need to create and learn new types: you foos array works for tabular operations without any changes!

Julia Base provides map and filter, packages provide more: SplitApplyCombine.jl β€” a little of everything, FlexiJoins.jl β€” different joins, etc.

Recently, I also registered the DataManipulation.jl package that aims to be an all-in-one solution for data manipulation with native Julia structures (like arrays). It both aggregates existing packages with compatible api and philosophy for convenience, and defines some functions itself.

1 Like

What if you write a funcion with that?

1 Like

OK, I spent some time constructing a more representative example and writing a little benchmark. I think the issue is that while the problem I’m trying to solve is strongly O(n^2), in that (in the worst case) we really do need to check every foo against every other foo.

In the contrived example below, the end goal is essentially just to calculate

sum(
    0.5 * (foo.a + bar.b)
    for foo in foos for bar in foos
    if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
)

Results

  • In the benchmark results, the fastest way to do this was essentially to do exactly what I have written above, checking the whole if for every foo, bar pair: query_forloop(), with median time 3.7 microseconds.
  • I tried to be clever by filtering for foo where foo.a == 1 and bar where bar.b > 2 in the outer loop, but this was slower: query_itertools(), 6.4 microseconds.
  • A similar approach with a TypedTable was slower: query_typedtable(), 47 microseconds.
  • As was a DataFrame (query_dataframe(): 590 microseconds), although I suspect there are ways to reduce the allocations in this function, as it has been a long time since I did a lot of DataFrames.jl work.

Remaining questions

  • Is my benchmark accurate? I always seem to fall into one of the pitfalls with these.
  • Can the table and dataframe approaches be improved?

Benchmark details

main.jl:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg


"Data struct."
struct Foo
    a::Int
    b::Int
    c::Int

    "Random, representative instance of this data struct."
    function Foo()
        return new(rand(1:5), rand(1:5), rand(1:5))
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
a for loop. Julia style?
"""
function query_forloop(foos::Vector{Foo})
    res = 0.0
    for foo in foos, bar in foos
        if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
            res += 0.5 * (foo.a + bar.b)
        end
    end
    return res
end

"""
Typical calculation I want to run on a `Vector{Foo}` using
various built in iteration helpers. Python style?
"""
function query_itertools(foos::Vector{Foo})
    return sum(
        Iterators.filter(
            Iterators.product(
                filter(foo -> foo.a == 1, foos),
                filter(bar -> bar.b > 2, foos),
            ),
        ) do (foo, bar)
            foo.a + bar.c == 3
        end,
        init = 0.0,
    ) do (foo, bar)
        0.5 * (foo.a + bar.b)
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
`TypedTables.jl`.
"""
function query_typedtable(foos::Vector{Foo})
    # I realize this NamedTuple conversion adds additional overhead, but 
    # changing the representation of Foo is not within the parameters of the
    # question, so we have to count this time.
    # This is pretty hard to read, huh?
    table = Table(
        (
            foo_a = foo.a,
            foo_b = foo.b,
            foo_c = foo.c,
            bar_a = bar.a,
            bar_b = bar.b,
            bar_c = bar.c,
        ) for (foo, bar) in Iterators.filter(
            Iterators.product(
                filter(foo -> foo.a == 1, foos),
                filter(bar -> bar.b > 2, foos),
            ),
        ) do (foo, bar)
            foo.a + bar.c == 3
        end
    )

    return sum(table, init = 0.0) do row
        0.5 * (row.foo_a + row.bar_b)
    end
end


"""
Typical calculation I want to run on a `Vector{Foo}` using
`DataFrames.jl`.
"""
function query_dataframe(foos::Vector{Foo})
    df = DataFrame(foos)
    foo_subset = subset(df, :a => a -> a .== 1)
    bar_subset = subset(df, :b => b -> b .> 2)
    xjoin = crossjoin(foo_subset, bar_subset, renamecols = "_foo" => "_bar")

    return sum(
        # note: not sure if falling back to Julia's filter is the right 
        # move here
        filter(eachrow(xjoin)) do row
            row.a_foo + row.c_bar == 3
        end,
        init = 0.0,
    ) do row
        0.5 * (row.a_foo + row.b_bar)
    end
end


const SIZE::Int = 100

function main()
    @show VERSION

    foos = [Foo() for _ = 1:SIZE]

    @assert (
        query_forloop(foos) ==
        query_itertools(foos) ==
        query_typedtable(foos) ==
        query_dataframe(foos)
    )

    for fun in (query_forloop, query_itertools, query_typedtable, query_dataframe)
        println(nameof(fun))
        display(@benchmark $fun($foos))
    end
end

main()

Then

julia --project main.jl | tee results.txt

yields results.txt:

VERSION = v"1.9.2"
query_forloop
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.776 ΞΌs …  10.196 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.797 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.104 ΞΌs Β± 669.340 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚   ▅▃▁                                                    ▁
  β–ˆβ–ˆβ–ˆβ–†β–…β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–†β–‡β–‡β–†β–‡β–‡β–†β–‡β–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–†β–†β–†β–†β–…β–…β–…β–†β–†β–…β–…β–…β–…β–ƒβ–„β–ƒ β–ˆ
  3.78 ΞΌs      Histogram: log(frequency) by time      6.79 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
query_itertools
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.708 ΞΌs … 491.093 ΞΌs  β”Š GC (min … max): 0.00% … 95.08%
 Time  (median):     6.401 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.051 ΞΌs Β±  10.420 ΞΌs  β”Š GC (mean Β± Οƒ):  3.42% Β±  2.30%

     ▁    β–ˆβ–†                                                   
  β–‚β–ƒβ–„β–ˆβ–‡β–„β–…β–†β–ˆβ–ˆβ–…β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–ƒβ–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  5.71 ΞΌs         Histogram: frequency by time        10.3 ΞΌs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.
query_typedtable
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  42.745 ΞΌs …  2.761 ms  β”Š GC (min … max): 0.00% … 87.32%
 Time  (median):     47.035 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   55.555 ΞΌs Β± 68.210 ΞΌs  β”Š GC (mean Β± Οƒ):  3.96% Β±  3.26%

  β–‡β–ˆβ–†β–‡β–‡β–†β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–       ▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁                   β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–…β–…β–…β–…β–†β–†β–…β–…β–…β–„ β–ˆ
  42.7 ΞΌs      Histogram: log(frequency) by time       103 ΞΌs <

 Memory estimate: 63.66 KiB, allocs estimate: 23.
query_dataframe
BenchmarkTools.Trial: 7794 samples with 1 evaluation.
 Range (min … max):  544.023 ΞΌs …   2.824 ms  β”Š GC (min … max): 0.00% … 65.21%
 Time  (median):     590.213 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   634.502 ΞΌs Β± 168.265 ΞΌs  β”Š GC (mean Β± Οƒ):  1.21% Β±  4.71%

  β–†β–ˆβ–…β–„β–†β–†β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–                                           β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–‡β–†β–†β–‡β–†β–†β–†β–†β–†β–‡β–‡β–†β–‡β–‡β–ˆβ–‡β–‡β–†β–†β–†β–†β–‡β–†β–…β–†β–…β–…β–…β–…β–…β–„β–„β–…β–„ β–ˆ
  544 ΞΌs        Histogram: log(frequency) by time       1.22 ms <

 Memory estimate: 223.55 KiB, allocs estimate: 7216.

It gets compiled away, of course :slight_smile: But my original question was about order of complexity, specifically if there’s a way to reduce the runtime from O(n^2) if you happen to know that the number of foo, bar pairs that pass the condition is small (in the problem I was working on this week, proportional to n rather than n^2). (And, of course, the # do stuff part is nonempty.)

But since I got such great suggestions for Julia packages to try, I wrote a Julia benchmark anyhow, so the focus of my question has shifted a bit.

Tried to be clever but the results are middling:

"""
Branchless version of `query_forloop()`.
"""
function query_branchless(foos::Vector{Foo})
    return sum(
        (foo.a == 1) * (bar.b > 2) * (foo.a + bar.c == 3) * 0.5 * (foo.a + bar.b)
        for foo in foos for bar in foos,
        init = 0.0
    )
end
query_branchless
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.652 ΞΌs …  1.843 ms  β”Š GC (min … max): 0.00% … 95.73%
 Time  (median):     16.401 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.502 ΞΌs Β± 18.567 ΞΌs  β”Š GC (mean Β± Οƒ):  1.01% Β±  0.96%

  β–‡β–ˆβ–†β–‚ ▁▅▃▁▆▃▃▁  β–…β–…β–…β–„β–‚ ▂▃▄▄▃▁  ▁▁▁▁                           β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–†β–†β–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–†β–†β–†β–†β–…β–…β–†β–…β–„β–…β–„β–„ β–ˆ
  14.7 ΞΌs      Histogram: log(frequency) by time      28.4 ΞΌs <

You could use

for i in 1: length(foo), j in i+1: length(foo)

To avoid running over repeated pairs. Something else would be a different computation than your first example, and more specific details of your actual problem would be needed. I don’t think any package can avoid this O(n^2) loop in principle.

But the code does nothing so the timings are meaningless no?

The code in the benchmark does something (click the arrow next to Details" to expand)

It’s not clear to me where bar.c is defined.
Whatever number it is, if foo.a==1 then instead of checking foo.a + bar.c == 3, isn’t it easier to check that bar.c==2?

So I think this problem can actually be solved in O(N log(N)) because you can sort the data and use the sorting invariant to optimize the iteration.

Assuming I’ve gotten things right, I think the following code shows how to use a 3SUM like approach to do this sum faster:

julia> x = [(a = rand(1:10), b = rand(1:10), c = rand(1:10)) for i in 1:100_000];

julia> function faster_sum(x)
           is = sortperm(x, lt = (t1, t2) -> t1.a < t2.a)
           js = sortperm(x, lt = (t1, t2) -> t1.c < t2.c)

           filter!(i -> x[i].a == 1, is)
           filter!(j -> x[j].b > 2, js)

           m = 1
           n = length(js)

           s = 0.0

           while m <= n
               if m > length(is)
                   return s
               end

               while n > 0 && x[is[m]].a + x[js[n]].c > 3
                   n -= 1
               end

               if n == 0
                   return s
               end

               min_n = 0
               if x[is[m]].a + x[js[n]].c == 3
                   max_n = n
                   while n > 0 && x[is[m]].a + x[js[n]].c == 3
                       s += 0.5 * (x[is[m]].a + x[js[n]].b)
                       n -= 1
                   end
                   min_n = n
                   n = max_n
                   m += 1
               else x[is[m]].a + x[js[n]].c < 3
                   m += 1
                   n = min_n - 1
               end
           end

           s
       end
faster_sum (generic function with 1 method)

julia> faster_sum(x)
2.9932832e8

julia> sum(
           0.5 * (foo.a + bar.b)
           for foo in x for bar in x
           if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
       )
2.9932832e8
2 Likes

an alternative. Probably not faster but from a different point of view

function falls(foos)
    A=findall(f->f.a==1, foos)
    C=findall(f->f.c==2, foos)
    B=findall(f->f.b>2, foos)
    BC=intersect(B,C)
    (sum(f->f.b,foos[BC])+length(BC))*length(A)*0.5
end

the faster form

function falls(foos)
    A=findall(f->f.a==1, foos)
    BC=findall(f->f.c==2&&f.b>2, foos)
    (sum(f->f.b,foos[BC])+length(BC))*length(A)*0.5
end
1 Like

In the slightly more fleshed out example I gave in my follow-up comment, the struct has three fields, a, b, and c.

Yes, in this case you can just math out that the condition is equivalent to bar.c == 2, but this is a contrived example for the purposes of illustration. :slight_smile: The point is that we have a boolean filter condition that depends on the specific foo, bar pair.

In my humble opinion, there is not much sense in discussing these alternatives without the exact use case. Being able, or not, to define a O(n) algorithm for a pairwise comparison like that is dependent on the nature of the comparison and the properties of the elements. A whole class of problems where such things can be done are those tackled by nearest neighbors algorithms. But they depend on the definition of some distance metrics.
But for other problems there can’t be such an algorithm. So, the closer MWE gets to the true problem, the likelier is to actually find an optimal solution.

1 Like

Had to study this for a bit, but I think I get it (gratuitously changing symbols in an effort to cement my own understanding):

  1. Make two copies of the original list, call then baz and bar.
  2. Filter baz and bar in place, so that baz only contains those that satisfy the .a == 1 condition, and bar only contains those that satisfy the .b > 2 condition.
  3. Now sort baz and bar in place, the former by .a, and the latter by .c. (The order of steps 2 and 3 doesn’t matter.)

Everything up until this point was O(n \log n).

  1. Start one pointer i at the beginning of baz and another pointer j at the end of bar.
  2. Let res = 0.
  3. While i < j:
    a. If baz[i] + bar[j] > 3, then decrement j (which weakly decreases the value of baz[i] + bar[j] at the next loop)
    b. If < 3, then increment i (which weakly increases the value of baz[i] + bar[j] at the next loop)
    c. If == 3, then baz and bar satisfy the condition, so increment res by 0.5 * (baz[i].a + bar[j].b). Now increment i or decrement j (doesn’t matter which) and proceed.

This is all O(n), because i and j start out n-1 apart, and the gap closes by 1 at every iteration.

As is, this method will fail at step 6c if there are duplicates in the original list. Suppose that the input is

[ .... , foo1, foo2, ..., bar1, bar2, ... ]

and all of the pairs foo1, bar1, foo1, bar2, foo2, bar1, and foo2, bar2 satisfy the condition. Then when the pointers are at i: foo1 and j: bar2, you have to either increment i (missing the pair foo1, bar1) or decrement j (missing the pair foo2, bar2).

(Note that the 3SUM problem you linked allows you to assume that the original array contains no duplicates, because the entries are just numbers anyway.)

I think this part of your loop

                   while n > 0 && x[is[m]].a + x[js[n]].c == 3
                       s += 0.5 * (x[is[m]].a + x[js[n]].b)
                       n -= 1
                   end

is what corrects for the possibility of duplicates, but doesn’t this inner loop take O(n) time, rendering the whole thing O(n^2) in the worst case?

Actually, I suppose an accurate solution (if you allow duplicates) has to have O(n^2) runtime, because if the original list is just n copies of (a = 1, b = 3, c = 2), then the condition holds for every pair. What this illustrates is that I haven’t contrived my example very well :joy:. Coming up with representative data is never as easy as you think…

You’re right: my code is not very careful about duplicates. That inner loop is meant to take care of duplicates, but I think it actually has the bad property (in the context of the rest of the code) that it misses some duplicates.

I think for this problem it’s maybe useful to distinguish two types of theoretical analysis:

  1. Worst case behavior relative to the input length.
  2. Worst case behavior relative to the number of tuples in the Cartesian product that satisfy the predicate you filter on.

I think you’re right that it’s easy to construct examples where all tuples in the Cartesian product satisfy the predicate, so in those cases you’re always just going to have to iterate over all O(N^2). So analysis (1) means you can’t ever do better than O(N^2).

But I suspect it’s often useful to contrast algorithms based on how much they exhibit O(N^2) behavior when the number of tuples that satisfy the predicate is very low, say closer to O(N) or even O(1).

Perhaps useful to contrast DataFrames performance with handwritten solutions relative to analysis (2).

1 Like

If it were possible, in your real case, as in the example you reproduced to separate the for loops, the execution times could be drastically reduced

foos = [Foo() for _ = 1:100]

function query_forloop(foos::Vector{Foo})
    res = 0.0
    for foo in foos, bar in foos
        if foo.a == 1 && bar.b > 2 && foo.a + bar.c == 3
            res += 0.5 * (foo.a + bar.b)
        end
    end
    return res
end
function query_forloop1(foos::Vector{Foo})
    res = 0.0
    #for foo in foos
        #if foo.a == 1 
            for bar in foos
                if bar.b > 2 && bar.c == 2
                    res += 0.5 * (1 + bar.b)
                end
            end
        #end
    #end
    return res
end

function query_forloop1(foos::Vector{Foo})
    res = 0.0
    
            for bar in foos
                if bar.b > 2 && bar.c == 2
                    res += 0.5 * (1 + bar.b)
                end
            end
            res*=count(f->f.a==1,foos)
    return res
end


julia> using BenchmarkTools

julia> @btime query_forloop(foos) 
  2.411 ΞΌs (0 allocations: 0 bytes)
715.0

julia> @btime query_forloop1(foos)  
  121.437 ns (0 allocations: 0 bytes)
715.0

Still speculating in a generic way on the problem (in the proposed MWE version), I note that innerjoin() could be preferable to the use of subset()

julia> function subsdf(df)
           dfj=DataFrame(b=[3, 4,5],c=2)
           innerjoin(df,dfj, on=[:b,:c])
       end
subsdf (generic function with 1 method)

julia> @btime subsdf(df)
  11.600 ΞΌs (188 allocations: 14.24 KiB)
9Γ—3 DataFrame
 Row β”‚ a      b      c     
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      3      2
   2 β”‚     1      3      2
   3 β”‚     5      3      2
   4 β”‚     3      4      2
   5 β”‚     3      4      2
   6 β”‚     4      5      2
   7 β”‚     1      5      2
   8 β”‚     3      5      2
   9 β”‚     2      5      2

julia> @btime subset(df, [:b,:c] => (b,c) ->@. (b > 2) && (c==2))
  27.900 ΞΌs (198 allocations: 10.67 KiB)
9Γ—3 DataFrame
 Row β”‚ a      b      c     
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      3      2
   2 β”‚     1      3      2
   3 β”‚     5      3      2
   4 β”‚     3      4      2
   5 β”‚     3      4      2
   6 β”‚     4      5      2
   7 β”‚     1      5      2
   8 β”‚     3      5      2
   9 β”‚     2      5      2

Better MWE

Thank you all for your responses. They have helped me see which features of my dataset are relevant to the problem at hand, and enabled me to come up with a more representative minimal example now. Consider data like this:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random

const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}

function makerecords(size::Int = 500)::Vector{Record}
    names = split("alice bob carl dana")
    subjects = split("english french math robotics swimming")
    terms = 2011:2011 + size

    res = [
        (; name, subject, term, score = rand())
        for name in names, subject in subjects, term in terms
    ][:]
    # shuffle for fair comparison
    shuffle!(res)
    # delete some random entries
    return res[1:floor(Int, 0.9 * end)]
end

So, we have a bunch of academic records for different students, in different terms, and different subjects. The condition for a β€œmatch” between two records record1 and record2 is:

  • The same student (same name)
  • The same term
  • The score in record1 is higher than in record2

For every pair of records matching these conditions, we want to add the score difference to the running total.

The way the data is constructed, there are only 5 subjects total. Thus, for a given student, term pair, we have at most a constant 5 (5 - 1) / 2= 10 score differences to add to the running total. Therefore, if we increase the number of students or terms in the dataset, although the number possible pairs of records increases quadratically, the number of matching pairs increases only linearly.

The naive approach is as follows:

function query_naive(records::Vector{Record})
    res = 0.0
    for record1 in records, record2 in records
        if (
            record1.name == record2.name &&
            record1.term == record2.term &&
            (diff = record1.score - record2.score) > 0.0
        )
            res += diff
        end
    end
    return res
end

Median execution time: 725 ms.

By spending O(n\log n) time to sort the records by name, then term, then score, we can use an approach like @johnmyleswhite suggested to keep track of the range of records matching a particular name, term pair and skip comparisons that have no chance of matching:

function query_sort(records::Vector{Record})
    records = sort(records, by = record -> (record.name, record.term, record.score))

    res = 0.0
    i = 1
    while i in keys(records)
        j = i+1
        while (
            j in keys(records) &&
            records[i].name == records[j].name &&
            records[i].term == records[j].term
        )
            # note: don't need to condition on this because we sorted by score above
            # @assert records[j].score - records[i].score β‰₯ 0
            res += records[j].score - records[i].score
            j += 1
        end
        i += 1
    end
    return res
end

In the benchmark results, this is the fastest method with median time 3.9 ms. (And could be made even faster by sorting in place at the top.) If the number of subjects is constant, then the inner while loop has constant time, so the outer while loop is O(n), and the whole function is O(n \log n) because of the sorting.

But this approach makes me a little uneasyβ€”how do I know, other than on unit tests with fake data, that I’m not missing any matches? I feel more confident using the following DataFrames approach, which is doing the same thing, but uses groupby to ensure that I am actually grabbing all of the relevant records:

function query_dataframe(records::Vector{Record})
    df = DataFrame(records)
    res = 0.0
    for student_term_df in groupby(df, [:name, :term])
        ax = axes(student_term_df, 1)
        for i in eachindex(ax), j in eachindex(ax)
            if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
                res += diff
            end
        end
    end
    return res
end

This had a median execution time of 8.5 ms, which is not as good as query_sort(), but it’s on the same order of magnitude, and it should be the same order of complexity (the inner for loops runs in constant time for a fixed number of subjects, although the constant is now 5 \times 5 instead of 5 (5-1) / 2). It is certainly good enough for my purposes.

I hope that this illustrates the point I was trying to make in the original post:

  • Although importing DataFrames.jl is not necessary to solve this problem efficientlyβ€”we can manually achieve the same efficiency with careful sorting and keeping track of pointersβ€”it does make the efficient solution a lot easier to write.
  • The dataframe solution is not just easier in terms of fewer lines of code, but it is also easier to look at the code and see that it is correct. There is no manual sorting or checking if indices are valid.
  • I can see other ways to solve the problem, too, such as constructing a dictionary that maps each unique student, term pair to the associated record indices, and then calling a function on the values of the dictionary.
    • My issue isn’t β€œI don’t know how to code this”—I’ll code it if I have toβ€”it’s that in a production context, any additional representations of my dataset (in this case, a dictionary) generate technical debt that I have to worry about maintaining and updating correctly if we insert a new record. (It doesn’t help that most of my work is in Python, where these things are all instance variables that have to be updated in a specific order and it is easy to lose track of what changes when.)
    • If all of my records exist in a dataframe, then I have only to keep the dataframe current, and I feel confident that my selection queries will always be correct.

This is why I find myself longing for a β€œlightweight alternative” to a dataframe. Although I haven’t looked at the source code, my understanding is that dataframe libraries like DataFrames.jl provide efficient sort, select and groupby operations by maintaining things like a sortperm of each column as a linked list, updating the sortperm each time a new record is added. This obviates the need for β€œmanual” sorting as done in sort_query(), and eliminates the need to come up with a bespoke sort order for every different kind of query I want to run on vector of records. Hence my question:

  • Is there something out there that provides this functionality without the full overhead of DataFrames.jl? Or am I underestimating the amount of work DataFrames.jl is doing under the hood here, and I should accept that it is the right-sized solution if I am unwilling to write functions like sort_query() myself?

Benchmark details

Full benchmark script:

using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random

const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}

function makerecords(size::Int = 500)::Vector{Record}
    names = split("alice bob carl dana")
    subjects = split("english french math robotics swimming")
    terms = 2011:2011 + size

    res = [
        (; name, subject, term, score = rand())
        for name in names, subject in subjects, term in terms
    ][:]
    # shuffle for fair comparison
    shuffle!(res)
    # delete some random entries
    return res[1:floor(Int, 0.9 * end)]
end


function query_naive(records::Vector{Record})
    res = 0.0
    for record1 in records, record2 in records
        if (
            record1.name == record2.name &&
            record1.term == record2.term &&
            (diff = record1.score - record2.score) > 0.0
        )
            res += diff
        end
    end
    return res
end


function query_sort(records::Vector{Record})
    records = sort(records, by = record -> (record.name, record.term, record.score))

    res = 0.0
    i = 1
    while i in keys(records)
        j = i+1
        while (
            j in keys(records) &&
            records[i].name == records[j].name &&
            records[i].term == records[j].term
        )
            # note: don't need to condition on this because we sorted by score above
            # @assert records[j].score - records[i].score β‰₯ 0
            res += records[j].score - records[i].score
            j += 1
        end
        i += 1
    end
    return res
end


function query_dataframe(records::Vector{Record})
    df = DataFrame(records)
    res = 0.0
    for student_term_df in groupby(df, [:name, :term])
        ax = axes(student_term_df, 1)
        for i in eachindex(ax), j in eachindex(ax)
            if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
                res += diff
            end
        end
    end
    return res
end


function main()
    @show VERSION

    records = makerecords()

    @assert (
        query_naive(records) β‰ˆ
        query_sort(records) β‰ˆ
        query_dataframe(records)
    )

    for fun in (query_naive, query_sort, query_dataframe)
        println(nameof(fun))
        display(@benchmark $fun($records))
    end
end

main()

Results:

VERSION = v"1.9.2"
query_naive
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  697.862 ms … 756.464 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     725.073 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   726.045 ms Β±  19.627 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                β–ˆ β–ˆ        β–ˆβ–ˆ                   β–ˆ          β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  698 ms           Histogram: frequency by time          756 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
query_sort
BenchmarkTools.Trial: 1132 samples with 1 evaluation.
 Range (min … max):  3.953 ms …   8.178 ms  β”Š GC (min … max): 0.00% … 8.34%
 Time  (median):     4.166 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.400 ms Β± 610.375 ΞΌs  β”Š GC (mean Β± Οƒ):  0.28% Β± 1.57%

  β–ˆβ–†β–                                                          
  β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–„β–„β–„β–…β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  3.95 ms         Histogram: frequency by time        7.27 ms <

 Memory estimate: 1.10 MiB, allocs estimate: 4.
query_dataframe
BenchmarkTools.Trial: 522 samples with 1 evaluation.
 Range (min … max):  8.545 ms … 15.950 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.945 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.566 ms Β±  1.301 ms  β”Š GC (mean Β± Οƒ):  1.31% Β± 3.42%

  β–„β–ˆβ–ˆβ–†β–„  ▁▁▂▃▄▃▃▁▂                                            
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–ˆβ–„β–‡β–…β–…β–…β–…β–β–†β–β–†β–„β–„β–†β–‡β–†β–„β–β–β–β–β–β–β–β–β–…β–„β–β–β–„β–„β–„β–„β–β–β–…β–„β–… β–‡
  8.54 ms      Histogram: log(frequency) by time     15.1 ms <

 Memory estimate: 4.75 MiB, allocs estimate: 237092.
1 Like