Hello folks,
I have two huge data frames (described below) and I’m running my_module.f5(df1, df2)
based on the module and functions described below. The code works with small data frames, but when I run it with those big data frames, a massive amount of allocations is produced making a powerful server stops the process, and the time increases exponentially. I think the reason is I’m saving unnecessary data, especially in functions f3
, f4
, and f5
, but I don’t know how to solve it. So, I would really appreciate it if you help me to reduce allocations and make the code faster.
Relevant discussion points:
- Maybe it’s better to open the files and work on them instead of loading them? If so, how it can be done?
- I tried to replace
f3
,f4
, andf5
withf6
andf7
by usingDataFrames.combine
.f6
effectively applyf2
to eachid1
ofdf1
, butf7
didn’t work when I tried to applyf6
in eachid2
ofdf2
. - Is it possible to save those big data frames to avoid loading them every time the code is run?
- How to precompile the code to run it as a program in the terminal of Linux?
Data frames description:
df1
has 3,000,000,000 (3 billion) rows and 8 columns including:
id1
: >50,000 alphanumeric stringo
: >50,000 alphanumeric string (same levels than columno
ofdf2
)i
: integers from 1 to 50 (same levels than columno
ofdf2
)m
: A single upper case alphabetic charactern
: A single upper case alphabetic character
df2
has 15,000,000,000 (15 billion) rows and 8 columns.
id2
: >250,000 alphanumeric stringo
: >50,000 alphanumeric string (same levels than columno
ofdf1
)i
: integers from 1 to 50 (same levels than columno
ofdf1
)m
: A single upper case alphabetic charactern
: A single upper case alphabetic character
Functions explanation
f1
: Extract, as vectors, the columns m
and n
of df1
and the column j
of df2
after filtering by i
. Then, compare the elements of those vectors that match a boolean operator (evm
); count the number of those matches (true
) in a row (nmr
; there could be more than one nmr
as they are separated by each false
). Finally, If any nmr >= threshold
, then stops and returns 1.
f2
: For every unique integer in column i
run f1
; count the number of ones (1s) returned by f1
(count_match
); if the number of mismatches (missing 1s) > (number of i
s - threshold), then stops. Finally, returns count_match
.
f3
: Filter df2
by column id2 = unique_id2
f4
: df1
cannot be the same id2
in f5
f5
: Apply f2
to all combinations of id1
(column of df1
) and id2
(column of df2
)
f6
: Apply f2
to each id1
of df1
f7
: Doesn’t work. Goal: to apply f6
in each id2
of df2
.
Module:
module my_module using DataFrames, Chain function f1(i::Int, sub_df1::SubDataFrame, sub_df2::DataFrame, parameter = 0.5) # Column j of df1 when column col_i==i df2_ij = df2[df2.col_i .== i, :col_j] # Columns m and n of df2 when column i==i df1_im = sub_df1[sub_df1.col_i.== i, :col_m] df1_in = sub_df1[sub_df1.col_i.== i, :col_n] evm= Vector{Bool} evm= (df2_ij .! = df1_im) .& (df2_ij .! = df1_in) # Add one true at the beginning and another at the end of evm. evm= pushfirst!(evm, true) evm= push!(evm, true) # Remove missing from evm evm= evm[findall(!ismissing, evm)] # Length of evm len::Int = length(evm) # Last element to start counting nmr last_element_nmr::Int = ceil(Int, len * (1-parameter)) # threshold threshold = (len * parameter) + 1 # Is there any nmr >= threshold ? for element::Int = 2:last_element_nmr if evm[element-1] == true nmr::Int = findnext(evm[element:length(evm)], true) nmr >= threshold && return 1 nmr >= threshold && break nmr = 0 end end end function f2(sub_df1::SubDataFrame, sub_df2::DataFrame, threshold = 5) count_mismatch::Int = 0 count_match::Int = 0 result = Int for i::Int = 1:50 if f1(i, sub_df1, sub_df2) == nothing mismatch_i = 1 count_mismatch += mismatch_i count_mismatch > 10 - threshold && break else match_i = 1 count_match += match_i if count_match >= threshold result = 1 end end end return(result, count_match) end function f3(df2::DataFrame, unique_id2) df2[df2.id2 .== unique_id2, :] end function f4(df1::DataFrame, unique_id2) df1[df1.id1 .!= unique_id2, :] end function f5(df1::AbstractDataFrame, df2::AbstractDataFrame) id2_list = unique(df2[!, :id2]) Threads.@threads for d in id2_list dm = f3(df2, d) id1_list = f4(df1, d) @chain id1_list begin groupby(:id1) for ms in _ if f2(ms, dm)[1] == 1 println(d, "\t", ms.id1[1], "\t", f2(ms, dm)[2]) end end end end end function f6(df1::AbstractDataFrame, df2::AbstractDataFrame) @chain df1 begin groupby(:id1) combine(_) do ms f2(ms, df2) end end end function f7(df1::AbstractDataFrame, df2::AbstractDataFrame) @chain df2 begin groupby(:id2) combine(_) do dm f6(df1, dm) end end end end
This question involves data . @bkamins, @pdeffebach @nilshg I mention you because I know you are experts in these questions.
Thank you so much in advance
Boris