julia> function query_dataframe1(records::Vector{Record})
df = DataFrame(records)
res = 0.0
grs=groupby(df, [:name, :term])
for gr in grs
res+=sum(e for e in (gr.score' .-gr.score) if e>0)
end
return res
end
query_dataframe1 (generic function with 1 method)
julia> @btime query_dataframe1(r)
2.417 ms (38181 allocations: 2.54 MiB)
5418.554731103728
julia> @btime query_sort(r)
2.294 ms (4 allocations: 1.10 MiB)
5418.554731103722
Yet another illustration of convenient and efficient table manipulation simply using arrays, without any fancy specialized types:
- Direct translation of
query_dataframe1
:
using StructArrays
using DataManipulation
query_group(records) = @p let
records
convert(StructArray, __)
group((_.name, _.term))
sum() do gr
sum(s1 - s2 for (s1, s2) in Iterators.product(gr.score, gr.score) if s1 > s2; init=0.0)
end
end
@btime query_group(r)
# 796.542 Îźs (51 allocations: 1.88 MiB)
# and if you already work with columnar tables and r is a StructArray:
@btime query_group($(StructArray(r)))
# 716.573 Îźs (42 allocations: 1.33 MiB)
- The most straightforward solution, putting all conditions into
join
â slower but still pretty good, much better than naive looping. I guessFlexiJoins
can use some optimitization for complex join conditionsâŚ
using FlexiJoins
query_join(records) = @p let
records
innerjoin((__, __), by_key((:name, :term)) & by_pred(:score, >, :score))
sum(_[1].score - _[2].score)
end
@btime query_join(r)
# 8.426 ms (27 allocations: 794.38 KiB)
Another variation (which doesnât depend on much supporting packages):
import IterTools as Itr
function process_group(records, grp)
L = length(grp)
coeffs = range(-2*(L >> 1) + ((L+1) & 1); step=2, length=L)
return mapreduce(*, +, Iterators.map(i -> records[i].score, grp), coeffs)
end
function query_sortperm(records::Vector{Record})
p = sortperm(records, by = record -> (record.name, record.term, record.score))
return sum(Iterators.map(grp -> process_group(records, grp), Itr.groupby(i -> hash((records[i].name, records[i].term)), p)))
end
The benchmarks:
julia> @btime query_sortperm($records)
3.559 ms (10028 allocations: 579.58 KiB)
julia> @btime query_sort($records)
2.359 ms (4 allocations: 1.10 MiB)
julia> @btime query_naive($records)
413.015 ms (0 allocations: 0 bytes)
julia> @btime query_dataframe1($records)
1.872 ms (38177 allocations: 2.54 MiB)
The trick here is also to use an O(n) calculation of the sum instead of O(n^2), after the sorting phase (which is more than linear).
There is still some optimization to squeeze out of this approach (as evidenced by excessive allocation).
function query_dataframe2(records::Vector{Record})
df = DataFrame(records)
res = 0.0
grs=groupby(df, [:name, :term])
for gr in grs
#res+=sum(e for e in (gr.score' .-gr.score) if e>0)
res+=sum(abs(e) for e in (gr.score' .-gr.score))/2
end
return res
end
could you explain the logic of the script? Please.
so I avoid the trouble of figuring it out on my own (assuming I can)
This is pretty slick, thank you. The real killer app here is the IterTools.groupby()
function, and it just so happens that Python stdlib itertools
has the same thing. You have also done some optimization trickery in these lines
coeffs = range(-2*(L >> 1) + ((L+1) & 1); step=2, length=L)
return mapreduce(*, +, Iterators.map(i -> records[i].score, grp), coeffs)
that I need to spend some time thinking about, but I think the same order of complexity will be achievable in a more transparent way .
The âtrickeryâ is to rearrange the sum and note the number of times each score appears with each sign.
Every score will appear with all other scores, with + when greater and - when smaller. The number of scores greater or smaller depends on the rank in the sorted score list. So the smallest score will appear only with - signs (so basically, L-1 times with - signs). The next score, will appear + with smallest score and - with the rest, so the - cancels another + and the coefficient grows by 2. Now that Iâve explained it, perhaps the coeffs
calculation can be made simpler.
Finally the mapreduce
just multiplies coeffs
by the scores and sums.
(this is for @rocco_sprmnt21 too).
Shorter implementation based on previous explanation:
function query_sortperm(records::Vector{Record})
process_grp = grp ->
mapreduce(*, +,
Iterators.map(i -> records[i].score, grp),
range(1-length(grp); step=2, length=length(grp)))
p = sortperm(records, by = record -> (record.name, record.term, record.score))
return sum(Iterators.map(process_grp,
Itr.groupby(i -> hash((records[i].name, records[i].term)), p)))
end
for those who happen to be passing through here and donât really want to work hard
#v1>v2>....>v5
#v1-v1,v1-v2,v1-v3, v1-v4,v1-v5 => v1 - v2, v1 - v3, v1 - v4, v1 - v5
#v2-v1,v2-v2,v2-v3, v2-v4,v2-v5 => v2 - v3, v2 - v4, v2 - v5
#v3-v1,v3-v2,v3-v3, v3-v4,v3-v5 => v3 - v4, v3 - v5
#v4-v1,v4-v2,v4-v3, v4-v4,v4-v5 => v4 - v5
#v5-v1,v5-v2,v5-v3, v5-v4,v5-v5 ----------------------------------
# 4*v1+(3-1)*v2+(2-2)*v3+(1-3)*v4-4*v5
coeffs=(L:-1:1) .- (1:L)
#v1>v2>....>v5
#v1-v1,v1-v2,v1-v3,v1-v4,v1-v5 => v1-v1, v1 - v2, v1 - v3, v1 - v4, v1 - v5
#v2-v1,v2-v2,v2-v3,v2-v4,v2-v5 => v2 - v2, v2 - v3, v2 - v4, v2 - v5
#v3-v1,v3-v2,v3-v3,v3-v4,v3-v5 => v3 - v3, v3 - v4, v3 - v5
#v4-v1,v4-v2,v4-v3,v4-v4,v4-v5 => v4 - v4, v4 - v5
#v5-v1,v5-v2,v5-v3,v5-v4,v5-v5 => v5 - v5
# ------------------------------------------------
# (5-1)*v1+(4-2)*v2+(3-3)*v3+(2-4)*v4+(1-5)*v5
Thanks all. Iâm marking this as solved by @Dan because IterTools.groupby
is essentially the âright-sizedâ solution I was looking for.
That said, I ended up going for a dictionary based approach, first constructing a dictionary that maps the tuple (name, term)
to the corresponding records. This made the most sense in the context of my larger program because I was already maintaining a list of student names and terms, so it is easy to make this dictionary on O(n) time as (pardon my Python)
grouped_records = {(name, term): [] for name, term in itertools.product(names, terms)}
for record in records:
grouped_records[(record.name, record.term)].append(record)
Then you can do the âsmallerâ O(n^2) call on the grouped records:
for group, group_records in grouped_records.items():
for record0, record1 in itertools.product(group_records, repeat=2):
process_pair(record0, record1, group)
(Unfortunately, my real problem wasnât amendable to Danâs coefficient trickâfor each matching pair we have to instantiate an object and append it to a list.)
Although all the allocations here mean that this code is not all that optimized, my main goal was to reduce the order of complexity, and this accomplishes that while allowing for easy cache validation checks such as
for group, group_records in grouped_records.items():
# Only proceed if the cache is still valid
if not all(
map(
lambda record: (record.name, record.term) == group,
group_records
)
):
raise AssertionError("cache invalid")
# or we could just throw a warning then `continue` to still allow
# other groups to be processed
# Do the processing
for record0, record1 in itertools.product(group_records, repeat=2):
process_pair(record0, record1, group)
which are important in the dynamic environment. (I guess you could do this with the groupby
approach, too, by checking that records
is sorted correctly.)
Did you try: GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia ?
Yes, but does that run in Python?
I am being cheeky, but the purpose of my question was to learn about general algorithmic concepts rather than specific Julia packages (hence posting in Off-Topicâalthough it looks like a mod has now relocated it to General Usage).