Increase performance in string comparison with sub-dataframes or sub-arrays

Hi @Oscar_Smith and @nilshg,

Thank you for your reply. I updated the code according to your messages. The new code is below and the example files to run the code are in this link: GitHub - bojusemo/Data---Increase-performance-in-string-comparison-with-sub-dataframes-or-sub-arrays-: These are the data posted in https://discourse.julialang.org/t/increase-performance-in-string-comparison-with-sub-dataframes-or-sub-arrays/64392

@nilshg, I’ll use the ShortStrings.jl.package you recommended to me. However, it looks like garbage collection is a problem generated in two of my lines (described below). Can you guide me please on how to avoid garbage collection with those lines?.

The profiling tells me:

  1. When I use a small “a” file, the time, allocation, and memory used are 0.007732 seconds (5.00 k allocations: 3.648 MiB)
  2. The performance values in point 1 increase exponentially. When a use an “a” file 100 times larger they are 0.445680 seconds (66.71 k allocations: 79.545 MiB). And my real problem has to deal with “a” files thousand of times larger than the one In my “a” example file.
  3. The performance plot below is obtained when I ran the code with an “a” file 1,000 larger. The bottleneck (red) are the lines 169 and 153.
    3.1. Line 169 is analysis_a(a_filter(potential_a, a_dataset, a_ID)) == 1 && return print(a_ID).
    3.2. Line 153 is potential_a = a_dataset[ ( a_dataset.ID_5 .== a_ID ) , :].

I really appreciate any help to improve those lines.

Thank you so much

Cheers

Boris

## Packages and Libraries
using CSV
using DataFrames
using Arrow
using Statistics
using DelimitedFiles
using DataFramesMeta
using StatsBase
using Query
using JLD
using BenchmarkTools
using Profile
##



## Read files
a = CSV.read(
  "C:\\Path\\File\\Location\\a.txt",
  DataFrame,
  header = 6,
  type = String,
  select = [1, 2, 3, 4],
  footerskip = 1,
  missingstrings = ["NA", "-"],
)

b = CSV.read(
  "C:\\Path\\File\\Location\\b.txt",
  DataFrame,
  missingstrings = ["NA", "-"],
)

c = CSV.read(
  "C:\\Path\\File\\Location\\c.txt",
  DataFrame,
  missingstrings = ["NA", "-"],
  select = [1, 2, 3],
  types = Dict(:ID_1 => String, :ID_2 => UInt8, :ID_3 => UInt32),
  delim = "\t",
  silencewarnings = true
)
##


## Data preprocessing
  # Remove missing ID_2s and ID_3s
c = c[completecases(c), :]
  # Remove X ID_2 from b, and c
b = b[b[:, :ID_2].!=30, :]
c = c[c[:, :ID_2].!=30, :]
  # Merge a with c
a = innerjoin(a, c, on = :ID_1 => :ID_1)
##




##
module a_module

## Packages and Libraries
using CSV
using DataFrames
using Arrow
using Statistics
using DelimitedFiles
using DataFramesMeta
using StatsBase
using Query
using JLD
using BenchmarkTools
using Profile
##

## Function to evaluate whether an ID_2 match between the b's profile and the potential a
function evaluate_ID_2_match(ID_2, potential_a, b=b)

  ## Define parameters
  minimum_profile_segment_length = 0.3 #  Proportion of the ID_2 equivalent to minimum length to match.
  ##

  # b's profile for the ID_2
  b_ID_2 = b[b[:, :ID_2].==ID_2, :ID_9]
  a_ID_2_ID_6 = potential_a[potential_a[:, :ID_2].==ID_2, :ID_6]
  a_ID_2_ID_7 = potential_a[potential_a[:, :ID_2].==ID_2, :ID_7]

  # mismatch per ID_1 (mpm). It is true when there is a mismatch between the b's ID_1 and the a's ID_1.
  mpm = Vector{Int}
  mpm = (b_ID_2 .!= a_ID_2_ID_6) .& (b_ID_2 .!=  a_ID_2_ID_7)

  # add a true at the end, and endpoint pot nmr in case there is no other.
  mpm = push!(mpm, true)

  # remove missing from mpm
  mpm = mpm[findall(!ismissing, mpm)]

  # Length of the b's profile for the ID_2
  len_b_ID_9::Int = length(mpm)

  # Last ID_1 to start counting nmr
  last_ID_1_nmr::Int = ceil(Int, len_b_ID_9 * (1-minimum_profile_segment_length))

  # Minimum nmr to match the b's profile with the a
  minimum_nmr = (len_b_ID_9 * minimum_profile_segment_length) + 1

  # Evaluate whether there is any nmr grater than or equal to the minimum nmr to consider the a match the b's profile
  for ID_1::UInt32 = 1:last_ID_1_nmr
    nmr::Int = findnext(mpm[ID_1:length(mpm)], true)
    if nmr >= minimum_nmr
      result_chr::Int = 1
      return(result_chr)
      break
    end
  end
end
##


## Analysis of potential a
function analysis_a(a,b)
  mismatch_a::Int = 0
  match_a::Int = 0
  chr_correct_a_cutoff = 20
  chr_wrong_a_cutoff::Int = 29 - chr_correct_a_cutoff # Wrong a cutoff.  After chr_wrong_a_cutoff of autosomes equal to mismatch (zero), then the remaining autosomes do not need to be compared because they cannot result in exceeding the cutoff of chr_correct_a_cutoff
  result_a = Int
  for ID_2::Int = 1:29
    ID_2 = 30 - ID_2 # Start from last ID_2. Example, the first ID_2 to analyse is the 29
    if evaluate_ID_2_match(ID_2, a, b) == nothing
      while mismatch_a < chr_wrong_a_cutoff
        mismatch_ID_2 = 1
        mismatch_a += mismatch_ID_2
        result_a = 0
      end
    else
      match_ID_2 = 1
      match_a += match_ID_2
      if match_a >= chr_correct_a_cutoff
        result_a = 1
      end
    end
  end
  return(result_a)
end

##


##
# Filter a pre-allocating the array a
function a_filter(potential_a::DataFrames.AbstractDataFrame, a_dataset::DataFrames.AbstractDataFrame, a_ID::AbstractVector{String})
  potential_a = a_dataset[ ( a_dataset.ID_5 .== a_ID ) , :]
  potential_a
end
##


## Analysis of potential a
function analysis_group_a(a_dataset, b)

  potential_a = DataFrame()
  result_a = 0

  number_of_potential_a::UInt32 = length(unique(a_dataset[:, :ID_5]))

  for pot_a::UInt32 = 1:number_of_potential_a
    a_ID = unique(a_dataset[:, :ID_5])[[pot_a]]
    analysis_a(a_filter(potential_a, a_dataset, a_ID),b) == 1 && return print(a_ID)
  end
end
##


## End module
end
##


## Check the speed
using BenchmarkTools
using Profile
using Pkg
using ProfileView
@time a_module.analysis_group_a(a,b)
@benchmark a_module.analysis_group_a(a,b)
@profview a_module.analysis_group_a(a,b)
##