DataFrames - reduce allocations and improve speed

Hello folks,

I have two huge data frames (described below) and I’m running my_module.f5(df1, df2) based on the module and functions described below. The code works with small data frames, but when I run it with those big data frames, a massive amount of allocations is produced making a powerful server stops the process, and the time increases exponentially. I think the reason is I’m saving unnecessary data, especially in functions f3, f4, and f5, but I don’t know how to solve it. So, I would really appreciate it if you help me to reduce allocations and make the code faster.

Relevant discussion points:

  • Maybe it’s better to open the files and work on them instead of loading them? If so, how it can be done?
  • I tried to replace f3, f4, and f5 with f6 and f7 by using DataFrames.combine . f6 effectively apply f2 to each id1 of df1, but f7 didn’t work when I tried to apply f6 in each id2 of df2.
  • Is it possible to save those big data frames to avoid loading them every time the code is run?
  • How to precompile the code to run it as a program in the terminal of Linux?

Data frames description:

df1 has 3,000,000,000 (3 billion) rows and 8 columns including:

  • id1: >50,000 alphanumeric string
  • o: >50,000 alphanumeric string (same levels than column o of df2)
  • i: integers from 1 to 50 (same levels than column o of df2)
  • m: A single upper case alphabetic character
  • n: A single upper case alphabetic character

df2 has 15,000,000,000 (15 billion) rows and 8 columns.

  • id2: >250,000 alphanumeric string
  • o: >50,000 alphanumeric string (same levels than column o of df1)
  • i: integers from 1 to 50 (same levels than column o of df1)
  • m: A single upper case alphabetic character
  • n: A single upper case alphabetic character

Functions explanation

f1: Extract, as vectors, the columns m and n of df1 and the column j of df2 after filtering by i. Then, compare the elements of those vectors that match a boolean operator (evm); count the number of those matches (true) in a row (nmr; there could be more than one nmr as they are separated by each false). Finally, If any nmr >= threshold, then stops and returns 1.

f2: For every unique integer in column i run f1; count the number of ones (1s) returned by f1 (count_match); if the number of mismatches (missing 1s) > (number of is - threshold), then stops. Finally, returns count_match.

f3: Filter df2 by column id2 = unique_id2

f4: df1 cannot be the same id2 in f5

f5: Apply f2 to all combinations of id1 (column of df1) and id2 (column of df2)

f6: Apply f2 to each id1 of df1

f7: Doesn’t work. Goal: to apply f6 in each id2 of df2.


module my_module

using DataFrames, Chain

function f1(i::Int, sub_df1::SubDataFrame, sub_df2::DataFrame, parameter = 0.5)

  # Column j of df1 when column col_i==i
  df2_ij = df2[df2.col_i .== i, :col_j]

  # Columns m and n of df2 when column i==i
  df1_im = sub_df1[sub_df1.col_i.== i, :col_m]
  df1_in = sub_df1[sub_df1.col_i.== i, :col_n]

  evm= Vector{Bool}
  evm= (df2_ij .! = df1_im) .& (df2_ij .! =  df1_in)

  # Add one true at the beginning and another at the end of evm.
  evm= pushfirst!(evm, true)
  evm= push!(evm, true)

  # Remove missing from evm
  evm= evm[findall(!ismissing, evm)]

  # Length of evm
  len::Int = length(evm)

  # Last element to start counting nmr
  last_element_nmr::Int = ceil(Int, len * (1-parameter))

  # threshold 
  threshold = (len * parameter) + 1

  # Is there any nmr >= threshold ?
  for element::Int = 2:last_element_nmr
    if evm[element-1] == true
      nmr::Int = findnext(evm[element:length(evm)], true)
      nmr >= threshold && return 1
      nmr >= threshold && break
      nmr = 0

function f2(sub_df1::SubDataFrame, sub_df2::DataFrame, threshold = 5)
  count_mismatch::Int = 0
  count_match::Int = 0
  result = Int
  for i::Int = 1:50
    if f1(i, sub_df1, sub_df2) == nothing
        mismatch_i = 1
        count_mismatch += mismatch_i 
        count_mismatch > 10 - threshold && break
      match_i = 1
      count_match += match_i 
      if count_match >= threshold
      result = 1
  return(result, count_match)

function f3(df2::DataFrame, unique_id2)
  df2[df2.id2 .== unique_id2, :]

function f4(df1::DataFrame, unique_id2)
  df1[df1.id1 .!= unique_id2, :] 

function f5(df1::AbstractDataFrame, df2::AbstractDataFrame)
  id2_list = unique(df2[!, :id2])
  Threads.@threads for d in id2_list 
    dm = f3(df2, d)
    id1_list = f4(df1, d)
    @chain id1_list begin
      for ms in _
        if f2(ms, dm)[1] == 1
          println(d, "\t", ms.id1[1], "\t", f2(ms, dm)[2])

function f6(df1::AbstractDataFrame, df2::AbstractDataFrame)
  @chain df1 begin
    combine(_) do ms
      f2(ms, df2)

function f7(df1::AbstractDataFrame, df2::AbstractDataFrame)
  @chain df2 begin
    combine(_) do dm
      f6(df1, dm)

This question involves #Data . @bkamins, @pdeffebach @nilshg I mention you because I know you are experts in these questions.

Thank you so much in advance


My advice:

Don’t write functions that accept DataFrames as inputs. DataFrames are useful to store vectors but for performance-critical code, write code that accepts vectors not DataFrames. The Pair syntax in DataFrames transformation functions does this for you, as does @with from DataFramesMeta.jl.

If you really need to work with DataFrames directly, use @views to make sure you aren’t allocating too much.

  1. In f3 and f4 use @view to reduce allocations.
  2. Do not use Threads.@threads as then you are doing many operations in parallel which means that it runs faster at the cost that more memory is allocated at once (if you do several operations at the same time all parallel processes allocate memory).
  3. When reading in the data use narrowest types for columns that would fit your data by explicitly specifying them (this often makes huge difference).

Please let me know if these things helped.

Given that the proportion of unique IDs is relatively low, make sure to use PooledArray columns rather than Vector{String}. InlineStrings.jl can also help depending on their length.

Do InlineStrings hurt for some lengths? I thought from a GC perspective they would be unequivocally better as they don’t have to be heap tracked and therefore don’t cause the massive slowdowns in GC sweeps that regular Strings do?

InlineStrings should be faster as they avoid slowing the GC, but copying large inline strings is going to be slower than setting a single pointer to a String so maybe at some point they become slower. But I haven’t done any benchmarking, so…

1 Like