Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs

rocco_sprmnt21 · August 6, 2023, 3:57pm

julia> function query_dataframe1(records::Vector{Record})
           df = DataFrame(records)
           res = 0.0
           grs=groupby(df, [:name, :term])
           for gr in grs
               res+=sum(e for e in (gr.score' .-gr.score) if e>0)
           end
           return res
       end
query_dataframe1 (generic function with 1 method)

julia> @btime query_dataframe1(r)
  2.417 ms (38181 allocations: 2.54 MiB)
5418.554731103728

julia> @btime query_sort(r)
  2.294 ms (4 allocations: 1.10 MiB)
5418.554731103722

aplavin · August 6, 2023, 5:05pm

Yet another illustration of convenient and efficient table manipulation simply using arrays, without any fancy specialized types:

Direct translation of query_dataframe1:

using StructArrays
using DataManipulation

query_group(records) = @p let
	records
	convert(StructArray, __)
	group((_.name, _.term))
	sum() do gr
		sum(s1 - s2 for (s1, s2) in Iterators.product(gr.score, gr.score) if s1 > s2; init=0.0)
	end
end

@btime query_group(r)
#  796.542 μs (51 allocations: 1.88 MiB)

# and if you already work with columnar tables and r is a StructArray:
@btime query_group($(StructArray(r)))
#    716.573 μs (42 allocations: 1.33 MiB)

The most straightforward solution, putting all conditions into join — slower but still pretty good, much better than naive looping. I guess FlexiJoins can use some optimitization for complex join conditions…

using FlexiJoins

query_join(records) = @p let
	records
	innerjoin((__, __), by_key((:name, :term)) & by_pred(:score, >, :score))
	sum(_[1].score - _[2].score)
end

@btime query_join(r)
#   8.426 ms (27 allocations: 794.38 KiB)

Dan · August 6, 2023, 5:45pm

Another variation (which doesn’t depend on much supporting packages):

import IterTools as Itr

function process_group(records, grp)
    L = length(grp)
    coeffs = range(-2*(L >> 1) + ((L+1) & 1); step=2, length=L)
    return mapreduce(*, +, Iterators.map(i -> records[i].score, grp), coeffs)
end

function query_sortperm(records::Vector{Record})
    p = sortperm(records, by = record -> (record.name, record.term, record.score))
    return sum(Iterators.map(grp -> process_group(records, grp), Itr.groupby(i -> hash((records[i].name, records[i].term)), p)))
end

The benchmarks:

julia> @btime query_sortperm($records)
  3.559 ms (10028 allocations: 579.58 KiB)

julia> @btime query_sort($records)
  2.359 ms (4 allocations: 1.10 MiB)

julia> @btime query_naive($records)
  413.015 ms (0 allocations: 0 bytes)

julia> @btime query_dataframe1($records)
  1.872 ms (38177 allocations: 2.54 MiB)

The trick here is also to use an O(n) calculation of the sum instead of O(n^2), after the sorting phase (which is more than linear).

There is still some optimization to squeeze out of this approach (as evidenced by excessive allocation).

rocco_sprmnt21 · August 6, 2023, 7:45pm

function query_dataframe2(records::Vector{Record})
    df = DataFrame(records)
    res = 0.0
    grs=groupby(df, [:name, :term])
    for gr in grs
        #res+=sum(e for e in (gr.score' .-gr.score) if e>0)
        res+=sum(abs(e) for e in (gr.score' .-gr.score))/2
    end
    return res
end

rocco_sprmnt21 · August 6, 2023, 7:49pm

could you explain the logic of the script? Please.
so I avoid the trouble of figuring it out on my own (assuming I can)

maxkapur · August 6, 2023, 8:22pm

This is pretty slick, thank you. The real killer app here is the IterTools.groupby() function, and it just so happens that Python stdlib itertools has the same thing. You have also done some optimization trickery in these lines

    coeffs = range(-2*(L >> 1) + ((L+1) & 1); step=2, length=L)
    return mapreduce(*, +, Iterators.map(i -> records[i].score, grp), coeffs)

that I need to spend some time thinking about, but I think the same order of complexity will be achievable in a more transparent way .

Dan · August 6, 2023, 8:56pm

The “trickery” is to rearrange the sum and note the number of times each score appears with each sign.
Every score will appear with all other scores, with + when greater and - when smaller. The number of scores greater or smaller depends on the rank in the sorted score list. So the smallest score will appear only with - signs (so basically, L-1 times with - signs). The next score, will appear + with smallest score and - with the rest, so the - cancels another + and the coefficient grows by 2. Now that I’ve explained it, perhaps the coeffs calculation can be made simpler.
Finally the mapreduce just multiplies coeffs by the scores and sums.
(this is for @rocco_sprmnt21 too).

Dan · August 6, 2023, 9:07pm

Shorter implementation based on previous explanation:

function query_sortperm(records::Vector{Record})
    process_grp = grp -> 
      mapreduce(*, +, 
        Iterators.map(i -> records[i].score, grp), 
        range(1-length(grp); step=2, length=length(grp)))
    p = sortperm(records, by = record -> (record.name, record.term, record.score))
    return sum(Iterators.map(process_grp, 
      Itr.groupby(i -> hash((records[i].name, records[i].term)), p)))
end

rocco_sprmnt21 · August 6, 2023, 10:25pm

for those who happen to be passing through here and don’t really want to work hard

#v1>v2>....>v5
#v1-v1,v1-v2,v1-v3, v1-v4,v1-v5 =>  v1 - v2, v1 - v3, v1 - v4, v1 - v5
#v2-v1,v2-v2,v2-v3, v2-v4,v2-v5 =>           v2 - v3, v2 - v4, v2 - v5
#v3-v1,v3-v2,v3-v3, v3-v4,v3-v5 =>                    v3 - v4, v3 - v5
#v4-v1,v4-v2,v4-v3, v4-v4,v4-v5 =>                             v4 - v5
#v5-v1,v5-v2,v5-v3, v5-v4,v5-v5     ----------------------------------
#                                  4*v1+(3-1)*v2+(2-2)*v3+(1-3)*v4-4*v5

coeffs=(L:-1:1) .- (1:L)

#v1>v2>....>v5
#v1-v1,v1-v2,v1-v3,v1-v4,v1-v5 =>  v1-v1, v1 - v2, v1 - v3, v1 - v4, v1 - v5
#v2-v1,v2-v2,v2-v3,v2-v4,v2-v5 =>         v2 - v2, v2 - v3, v2 - v4, v2 - v5
#v3-v1,v3-v2,v3-v3,v3-v4,v3-v5 =>                  v3 - v3, v3 - v4, v3 - v5
#v4-v1,v4-v2,v4-v3,v4-v4,v4-v5 =>                           v4 - v4, v4 - v5
#v5-v1,v5-v2,v5-v3,v5-v4,v5-v5 =>                                    v5 - v5    
#                                 ------------------------------------------------
#                                (5-1)*v1+(4-2)*v2+(3-3)*v3+(2-4)*v4+(1-5)*v5

maxkapur · August 12, 2023, 1:35pm

Thanks all. I’m marking this as solved by @Dan because IterTools.groupby is essentially the “right-sized” solution I was looking for.

That said, I ended up going for a dictionary based approach, first constructing a dictionary that maps the tuple (name, term) to the corresponding records. This made the most sense in the context of my larger program because I was already maintaining a list of student names and terms, so it is easy to make this dictionary on O(n) time as (pardon my Python)

grouped_records = {(name, term): [] for name, term in itertools.product(names, terms)}

for record in records:
    grouped_records[(record.name, record.term)].append(record)

Then you can do the “smaller” O(n^2) call on the grouped records:

for group, group_records in grouped_records.items():
     for record0, record1 in itertools.product(group_records, repeat=2):
        process_pair(record0, record1, group)

(Unfortunately, my real problem wasn’t amendable to Dan’s coefficient trick—for each matching pair we have to instantiate an object and append it to a list.)

Although all the allocations here mean that this code is not all that optimized, my main goal was to reduce the order of complexity, and this accomplishes that while allowing for easy cache validation checks such as

for group, group_records in grouped_records.items():
    # Only proceed if the cache is still valid
    if not all(
       map(
           lambda record: (record.name, record.term) == group,
           group_records
       )
    ): 
        raise AssertionError("cache invalid")
        # or we could just throw a warning then `continue` to still allow
        # other groups to be processed

    # Do the processing
    for record0, record1 in itertools.product(group_records, repeat=2):
        process_pair(record0, record1, group)

which are important in the dynamic environment. (I guess you could do this with the groupby approach, too, by checking that records is sorted correctly.)

ufechner7 · August 12, 2023, 1:51pm

Did you try: GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia ?

maxkapur · August 12, 2023, 3:05pm

Yes, but does that run in Python?

I am being cheeky, but the purpose of my question was to learn about general algorithmic concepts rather than specific Julia packages (hence posting in Off-Topic—although it looks like a mod has now relocated it to General Usage).