Better MWE
Thank you all for your responses. They have helped me see which features of my dataset are relevant to the problem at hand, and enabled me to come up with a more representative minimal example now. Consider data like this:
using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random
const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}
function makerecords(size::Int = 500)::Vector{Record}
names = split("alice bob carl dana")
subjects = split("english french math robotics swimming")
terms = 2011:2011 + size
res = [
(; name, subject, term, score = rand())
for name in names, subject in subjects, term in terms
][:]
# shuffle for fair comparison
shuffle!(res)
# delete some random entries
return res[1:floor(Int, 0.9 * end)]
end
So, we have a bunch of academic records for different students, in different terms, and different subjects. The condition for a βmatchβ between two records record1
and record2
is:
- The same student (same
name
)
- The same
term
- The
score
in record1
is higher than in record2
For every pair of records matching these conditions, we want to add the score difference to the running total.
The way the data is constructed, there are only 5 subjects total. Thus, for a given student, term
pair, we have at most a constant 5 (5 - 1) / 2= 10 score differences to add to the running total. Therefore, if we increase the number of students or terms in the dataset, although the number possible pairs of records increases quadratically, the number of matching pairs increases only linearly.
The naive approach is as follows:
function query_naive(records::Vector{Record})
res = 0.0
for record1 in records, record2 in records
if (
record1.name == record2.name &&
record1.term == record2.term &&
(diff = record1.score - record2.score) > 0.0
)
res += diff
end
end
return res
end
Median execution time: 725 ms.
By spending O(n\log n) time to sort the records by name
, then term
, then score
, we can use an approach like @johnmyleswhite suggested to keep track of the range of records matching a particular name, term
pair and skip comparisons that have no chance of matching:
function query_sort(records::Vector{Record})
records = sort(records, by = record -> (record.name, record.term, record.score))
res = 0.0
i = 1
while i in keys(records)
j = i+1
while (
j in keys(records) &&
records[i].name == records[j].name &&
records[i].term == records[j].term
)
# note: don't need to condition on this because we sorted by score above
# @assert records[j].score - records[i].score β₯ 0
res += records[j].score - records[i].score
j += 1
end
i += 1
end
return res
end
In the benchmark results, this is the fastest method with median time 3.9 ms. (And could be made even faster by sorting in place at the top.) If the number of subjects is constant, then the inner while
loop has constant time, so the outer while
loop is O(n), and the whole function is O(n \log n) because of the sorting.
But this approach makes me a little uneasyβhow do I know, other than on unit tests with fake data, that Iβm not missing any matches? I feel more confident using the following DataFrames approach, which is doing the same thing, but uses groupby
to ensure that I am actually grabbing all of the relevant records:
function query_dataframe(records::Vector{Record})
df = DataFrame(records)
res = 0.0
for student_term_df in groupby(df, [:name, :term])
ax = axes(student_term_df, 1)
for i in eachindex(ax), j in eachindex(ax)
if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
res += diff
end
end
end
return res
end
This had a median execution time of 8.5 ms, which is not as good as query_sort()
, but itβs on the same order of magnitude, and it should be the same order of complexity (the inner for
loops runs in constant time for a fixed number of subjects, although the constant is now 5 \times 5 instead of 5 (5-1) / 2). It is certainly good enough for my purposes.
I hope that this illustrates the point I was trying to make in the original post:
- Although importing DataFrames.jl is not necessary to solve this problem efficientlyβwe can manually achieve the same efficiency with careful sorting and keeping track of pointersβit does make the efficient solution a lot easier to write.
- The dataframe solution is not just easier in terms of fewer lines of code, but it is also easier to look at the code and see that it is correct. There is no manual sorting or checking if indices are valid.
- I can see other ways to solve the problem, too, such as constructing a dictionary that maps each unique
student, term
pair to the associated record indices, and then calling a function on the values of the dictionary.
- My issue isnβt βI donβt know how to code thisββIβll code it if I have toβitβs that in a production context, any additional representations of my dataset (in this case, a dictionary) generate technical debt that I have to worry about maintaining and updating correctly if we insert a new record. (It doesnβt help that most of my work is in Python, where these things are all instance variables that have to be updated in a specific order and it is easy to lose track of what changes when.)
- If all of my records exist in a dataframe, then I have only to keep the dataframe current, and I feel confident that my selection queries will always be correct.
This is why I find myself longing for a βlightweight alternativeβ to a dataframe. Although I havenβt looked at the source code, my understanding is that dataframe libraries like DataFrames.jl provide efficient sort
, select
and groupby
operations by maintaining things like a sortperm
of each column as a linked list, updating the sortperm
each time a new record is added. This obviates the need for βmanualβ sorting as done in sort_query()
, and eliminates the need to come up with a bespoke sort order for every different kind of query I want to run on vector of records. Hence my question:
- Is there something out there that provides this functionality without the full overhead of DataFrames.jl? Or am I underestimating the amount of work DataFrames.jl is doing under the hood here, and I should accept that it is the right-sized solution if I am unwilling to write functions like
sort_query()
myself?
Benchmark details
Full benchmark script:
using BenchmarkTools
using TypedTables
using DataFrames
using Pkg
using Random
const Record = NamedTuple{(:name, :subject, :term, :score), Tuple{SubString{String}, SubString{String}, Int64, Float64}}
function makerecords(size::Int = 500)::Vector{Record}
names = split("alice bob carl dana")
subjects = split("english french math robotics swimming")
terms = 2011:2011 + size
res = [
(; name, subject, term, score = rand())
for name in names, subject in subjects, term in terms
][:]
# shuffle for fair comparison
shuffle!(res)
# delete some random entries
return res[1:floor(Int, 0.9 * end)]
end
function query_naive(records::Vector{Record})
res = 0.0
for record1 in records, record2 in records
if (
record1.name == record2.name &&
record1.term == record2.term &&
(diff = record1.score - record2.score) > 0.0
)
res += diff
end
end
return res
end
function query_sort(records::Vector{Record})
records = sort(records, by = record -> (record.name, record.term, record.score))
res = 0.0
i = 1
while i in keys(records)
j = i+1
while (
j in keys(records) &&
records[i].name == records[j].name &&
records[i].term == records[j].term
)
# note: don't need to condition on this because we sorted by score above
# @assert records[j].score - records[i].score β₯ 0
res += records[j].score - records[i].score
j += 1
end
i += 1
end
return res
end
function query_dataframe(records::Vector{Record})
df = DataFrame(records)
res = 0.0
for student_term_df in groupby(df, [:name, :term])
ax = axes(student_term_df, 1)
for i in eachindex(ax), j in eachindex(ax)
if (diff = student_term_df[j, :score] - student_term_df[i, :score]) > 0.0
res += diff
end
end
end
return res
end
function main()
@show VERSION
records = makerecords()
@assert (
query_naive(records) β
query_sort(records) β
query_dataframe(records)
)
for fun in (query_naive, query_sort, query_dataframe)
println(nameof(fun))
display(@benchmark $fun($records))
end
end
main()
Results:
VERSION = v"1.9.2"
query_naive
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min β¦ max): 697.862 ms β¦ 756.464 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 725.073 ms β GC (median): 0.00%
Time (mean Β± Ο): 726.045 ms Β± 19.627 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β ββ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
698 ms Histogram: frequency by time 756 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
query_sort
BenchmarkTools.Trial: 1132 samples with 1 evaluation.
Range (min β¦ max): 3.953 ms β¦ 8.178 ms β GC (min β¦ max): 0.00% β¦ 8.34%
Time (median): 4.166 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.400 ms Β± 610.375 ΞΌs β GC (mean Β± Ο): 0.28% Β± 1.57%
βββ
ββββββ
βββββ
ββββββββββββββββββββββββββββββββββββββββββββββββ β
3.95 ms Histogram: frequency by time 7.27 ms <
Memory estimate: 1.10 MiB, allocs estimate: 4.
query_dataframe
BenchmarkTools.Trial: 522 samples with 1 evaluation.
Range (min β¦ max): 8.545 ms β¦ 15.950 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 8.945 ms β GC (median): 0.00%
Time (mean Β± Ο): 9.566 ms Β± 1.301 ms β GC (mean Β± Ο): 1.31% Β± 3.42%
βββββ βββββββββ
ββββββββββββββββββββββββ
β
β
β
βββββββββββββββββββ
ββββββββββ
ββ
β
8.54 ms Histogram: log(frequency) by time 15.1 ms <
Memory estimate: 4.75 MiB, allocs estimate: 237092.