Hi @Oscar_Smith and @nilshg,
Thank you for your reply. I updated the code according to your messages. The new code is below and the example files to run the code are in this link: GitHub - bojusemo/Data---Increase-performance-in-string-comparison-with-sub-dataframes-or-sub-arrays-: These are the data posted in https://discourse.julialang.org/t/increase-performance-in-string-comparison-with-sub-dataframes-or-sub-arrays/64392
@nilshg, I’ll use the ShortStrings.jl.package you recommended to me. However, it looks like garbage collection is a problem generated in two of my lines (described below). Can you guide me please on how to avoid garbage collection with those lines?.
The profiling tells me:
- When I use a small “a” file, the time, allocation, and memory used are 0.007732 seconds (5.00 k allocations: 3.648 MiB)
- The performance values in point 1 increase exponentially. When a use an “a” file 100 times larger they are 0.445680 seconds (66.71 k allocations: 79.545 MiB). And my real problem has to deal with “a” files thousand of times larger than the one In my “a” example file.
- The performance plot below is obtained when I ran the code with an “a” file 1,000 larger. The bottleneck (red) are the lines 169 and 153.
3.1. Line 169 isanalysis_a(a_filter(potential_a, a_dataset, a_ID)) == 1 && return print(a_ID)
.
3.2. Line 153 ispotential_a = a_dataset[ ( a_dataset.ID_5 .== a_ID ) , :]
.
I really appreciate any help to improve those lines.
Thank you so much
Cheers
Boris
## Packages and Libraries
using CSV
using DataFrames
using Arrow
using Statistics
using DelimitedFiles
using DataFramesMeta
using StatsBase
using Query
using JLD
using BenchmarkTools
using Profile
##
## Read files
a = CSV.read(
"C:\\Path\\File\\Location\\a.txt",
DataFrame,
header = 6,
type = String,
select = [1, 2, 3, 4],
footerskip = 1,
missingstrings = ["NA", "-"],
)
b = CSV.read(
"C:\\Path\\File\\Location\\b.txt",
DataFrame,
missingstrings = ["NA", "-"],
)
c = CSV.read(
"C:\\Path\\File\\Location\\c.txt",
DataFrame,
missingstrings = ["NA", "-"],
select = [1, 2, 3],
types = Dict(:ID_1 => String, :ID_2 => UInt8, :ID_3 => UInt32),
delim = "\t",
silencewarnings = true
)
##
## Data preprocessing
# Remove missing ID_2s and ID_3s
c = c[completecases(c), :]
# Remove X ID_2 from b, and c
b = b[b[:, :ID_2].!=30, :]
c = c[c[:, :ID_2].!=30, :]
# Merge a with c
a = innerjoin(a, c, on = :ID_1 => :ID_1)
##
##
module a_module
## Packages and Libraries
using CSV
using DataFrames
using Arrow
using Statistics
using DelimitedFiles
using DataFramesMeta
using StatsBase
using Query
using JLD
using BenchmarkTools
using Profile
##
## Function to evaluate whether an ID_2 match between the b's profile and the potential a
function evaluate_ID_2_match(ID_2, potential_a, b=b)
## Define parameters
minimum_profile_segment_length = 0.3 # Proportion of the ID_2 equivalent to minimum length to match.
##
# b's profile for the ID_2
b_ID_2 = b[b[:, :ID_2].==ID_2, :ID_9]
a_ID_2_ID_6 = potential_a[potential_a[:, :ID_2].==ID_2, :ID_6]
a_ID_2_ID_7 = potential_a[potential_a[:, :ID_2].==ID_2, :ID_7]
# mismatch per ID_1 (mpm). It is true when there is a mismatch between the b's ID_1 and the a's ID_1.
mpm = Vector{Int}
mpm = (b_ID_2 .!= a_ID_2_ID_6) .& (b_ID_2 .!= a_ID_2_ID_7)
# add a true at the end, and endpoint pot nmr in case there is no other.
mpm = push!(mpm, true)
# remove missing from mpm
mpm = mpm[findall(!ismissing, mpm)]
# Length of the b's profile for the ID_2
len_b_ID_9::Int = length(mpm)
# Last ID_1 to start counting nmr
last_ID_1_nmr::Int = ceil(Int, len_b_ID_9 * (1-minimum_profile_segment_length))
# Minimum nmr to match the b's profile with the a
minimum_nmr = (len_b_ID_9 * minimum_profile_segment_length) + 1
# Evaluate whether there is any nmr grater than or equal to the minimum nmr to consider the a match the b's profile
for ID_1::UInt32 = 1:last_ID_1_nmr
nmr::Int = findnext(mpm[ID_1:length(mpm)], true)
if nmr >= minimum_nmr
result_chr::Int = 1
return(result_chr)
break
end
end
end
##
## Analysis of potential a
function analysis_a(a,b)
mismatch_a::Int = 0
match_a::Int = 0
chr_correct_a_cutoff = 20
chr_wrong_a_cutoff::Int = 29 - chr_correct_a_cutoff # Wrong a cutoff. After chr_wrong_a_cutoff of autosomes equal to mismatch (zero), then the remaining autosomes do not need to be compared because they cannot result in exceeding the cutoff of chr_correct_a_cutoff
result_a = Int
for ID_2::Int = 1:29
ID_2 = 30 - ID_2 # Start from last ID_2. Example, the first ID_2 to analyse is the 29
if evaluate_ID_2_match(ID_2, a, b) == nothing
while mismatch_a < chr_wrong_a_cutoff
mismatch_ID_2 = 1
mismatch_a += mismatch_ID_2
result_a = 0
end
else
match_ID_2 = 1
match_a += match_ID_2
if match_a >= chr_correct_a_cutoff
result_a = 1
end
end
end
return(result_a)
end
##
##
# Filter a pre-allocating the array a
function a_filter(potential_a::DataFrames.AbstractDataFrame, a_dataset::DataFrames.AbstractDataFrame, a_ID::AbstractVector{String})
potential_a = a_dataset[ ( a_dataset.ID_5 .== a_ID ) , :]
potential_a
end
##
## Analysis of potential a
function analysis_group_a(a_dataset, b)
potential_a = DataFrame()
result_a = 0
number_of_potential_a::UInt32 = length(unique(a_dataset[:, :ID_5]))
for pot_a::UInt32 = 1:number_of_potential_a
a_ID = unique(a_dataset[:, :ID_5])[[pot_a]]
analysis_a(a_filter(potential_a, a_dataset, a_ID),b) == 1 && return print(a_ID)
end
end
##
## End module
end
##
## Check the speed
using BenchmarkTools
using Profile
using Pkg
using ProfileView
@time a_module.analysis_group_a(a,b)
@benchmark a_module.analysis_group_a(a,b)
@profview a_module.analysis_group_a(a,b)
##