Group by criteria

rocco_sprmnt21 · March 17, 2023, 2:10pm

How would you solve the following problem?
Given the following array of strings we want to find all groups whose group leader starts with “AT”.
All the elements that are found before the first one starting with “AT” form a group by themselves.

julia> itr=[randstring("ACTG") for _ in 1:20]
20-element Vector{String}:
 "CTTCCGAG"
 "CCCGTGGT"
 "TCAAGGGT"
 "ATTAGATC"
 "TCTTACAC"
 "TTTCCGCC"
 "TCCGACCG"
 "GTCAGCTA"
 "CATGTTGC"
 "GAGGAACG"
 "GTCAATGC"
 "TACTCATT"
 "ATACTCTA"
 "AATTCACA"
 "AATCATAT"
 "GTATACCT"
 "ATTTTACT"
 "TTCAGAAG"
 "GTTGATGA"
 "GACGGCGG"

julia> mygroupby(itr, criteria)
4-element Vector{Tuple{Vararg{String}}}:
 ("CTTCCGAG", "CCCGTGGT", "TCAAGGGT")
 ("ATTAGATC", "TCTTACAC", "TTTCCGCC", "TCCGACCG", "GTCAGCTA", "CATGTTGC", "GAGGAACG", "GTCAATGC", "TACTCATT")
 ("ATACTCTA", "AATTCACA", "AATCATAT", "GTATACCT")
 ("ATTTTACT", "TTCAGAAG", "GTTGATGA")

nilshg · March 17, 2023, 2:22pm

Something like the answers here:

with findall(x -> startswith(x, "AT"), itr) to find the split points?

Oscar_Smith · March 17, 2023, 2:27pm

Note that if you’re working with DNA/RNA and care about performance you might want to use BioSequences.jl which can be faster than using Strings to store genes.

Dan · March 17, 2023, 2:55pm

using IterTools

groupby(let c=0; x->c+=startswith(x,"AT") end, itr)

gives an iterator returning a group each time:

julia> groupby(let c=0; x->c+=startswith(x,"AT") end, itr) |> collect
4-element Vector{Vector{String}}:
 ["CTTCCGAG", "CCCGTGGT", "TCAAGGGT"]
 ["ATTAGATC", "TCTTACAC", "TTTCCGCC", "TCCGACCG", "GTCAGCTA", "CATGTTGC", "GAGGAACG", "GTCAATGC", "TACTCATT"]
 ["ATACTCTA", "AATTCACA", "AATCATAT", "GTATACCT"]
 ["ATTTTACT", "TTCAGAAG", "GTTGATGA", "GACGGCGG"]

rocco_sprmnt21 · March 17, 2023, 5:50pm

It’s not as good as @Dan’s, but I’m proposing this one too

reduce((s,c)->startswith(c,"AC") ? push!(s,[c]) :  [s[1:end-1];[push!(last(s),c)]],itr, init=[String[]])

rafael.guerra · March 17, 2023, 9:25pm

This seems to perform well:

function group2(itr, str)
    ix = [1; findall(x -> startswith(x, str), itr); length(itr)+1]
    return [view(itr, ix[i]:(ix[i+1]-1)) for i in 1:(length(ix)-1)]
end

rocco_sprmnt21 · March 18, 2023, 1:56pm

this should give the same result

groupby(let c=true; x->c=startswith(x,"CT") ? !c : c  end, itr)|>collect

PS
if group(by,itr) had the analogous form of reduce, for example, with a two-variable function and an init, it would be more flexible.

Dan · March 18, 2023, 3:40pm

Same idea as last one, with some unicode xoring:

groupby(let c=true; x->c ⊻= startswith(x,"AT") end, itr)|>collect

rocco_sprmnt21 · March 18, 2023, 6:06pm

a variant

reduce((d,c)->startswith(c,"AC") ? (d[1]+=1;d[2][d[1]]=[c];d) :  (push!(get!(()->[],d[2],d[1]),c);d),itr, init=[0,Dict{Int, Vector{String}}()])

pdeffebach · March 18, 2023, 6:12pm

This answer seems pretty convoluted. Is it really helping answer the question? Julia obviously has a lot of ways to solve every problem, but I don’t think its useful to outline every single one.

rocco_sprmnt21 · March 18, 2023, 6:20pm

Yes. I know that the ones I wrote are convoluted and not very useful for the purpose of the problem.
But they are “useful” exercises… at least

rocco_sprmnt21 · March 18, 2023, 8:37pm

Now supposing we have the following list of strings, we want to group them (scrolling them from first to last) in such a way that the strings with the “not too different root” are in the same group.

20-element Vector{String}:
 "GCCCCACA"
 "TCCATTTT"
 "TGCCCCGT"
 "GGGTTAGT"
 "GATTGAAG"
 "CCCGTAAT"
 "GGACCACT"
 "GGCGACAC"
 "GCTCTGTA"
 "AGTCCTGC"
 "TCATTACC"
 "GAAGTAGT"
 "AACTCGAA"
 "ATGACCCT"
 "TATTATCG"
 "CCTTGTCA"
 "TGCTCGCC"
 "CCGCATGG"
 "CTCTTTTG"
 "TTATGGAC"

more precisely with dist(n1,n2)>0 where

dist(n1,n2)= length(intersect(first(n1,3),first(n2,3)))

something like this should come

"GCC" => ["GCCCCACA", "TCCATTTT", "TGCCCCGT", "GGGTTAGT", "GATTGAAG", "CCCGTAAT", "GGACCACT", "GGCGACAC", "GCTCTGTA", "AGTCCTGC", "TCATTACC", "GAAGTAGT", "AACTCGAA", "ATGACCCT"], 
"TAT" => ["TATTATCG", "CCTTGTCA", "TGCTCGCC"], 
"CCG" => ["CCGCATGG", "CTCTTTTG"], 
"TTA" => ["TTATGGAC"]

pdeffebach · March 18, 2023, 8:45pm

This problem is poorly specified. What about ties? What about transitive differences?

rocco_sprmnt21 · March 18, 2023, 8:51pm

If I understand the meaning of the observation, I specify better what I have already written
The groups are made up of consecutive elements of the list, similar to what the group function of itertools does

# Group consecutive values that share the same result of applying f.

pdeffebach · March 18, 2023, 9:02pm

The following should work

julia> using SplitApplyCombine;

julia> itr = [randstring("ACTG") for _ in 1:20];

julia> dist(n1, n2)= length(intersect(first(n1, 3), first(n2, 3)));

julia> diffs = similar(itr, Bool);

julia> for i in eachindex(itr)
           if i == firstindex(itr)
               diffs[i] = false
           else
               diffs[i] = dist(diffs[i-1], diffs[i]) > 0
           end
       end;

julia> diffs_groups = cumsum(diffs);

julia> group(diffs_goups, itr)
12-element Dictionaries.Dictionary{Int64, Vector{String}}
  0 │ ["GCGGGCCG"]
  1 │ ["ATGTCTGT", "TGTTATCA"]
  2 │ ["GCAAGAAC"]
  3 │ ["AGTTCCTA"]
  4 │ ["GGCGTACC", "GCTATGAT"]
  5 │ ["GACTCACG", "CAATGGTG"]
  6 │ ["AGAGCCTG"]
  7 │ ["CACGGGGA", "CATGATTG", "TTCTCAAG", "GAGTCTAT"]
  8 │ ["CCATGATA", "TCGCCGTA"]
  9 │ ["GGCATATT"]
 10 │ ["CTCTAACT", "CGTACGCG"]
 11 │ ["CTCGCTGT"]

rocco_sprmnt21 · March 18, 2023, 10:23pm

in this case, the whole list is one group.
For each string, in the first 3 letters there is at least one of the first 3 of the list leader

1-element Dictionary{String, Vector{String}}
 "GCG" │ ["GCGGGCCG", "ATGTCTGT", "TGTTATCA", "GCAAGAAC", "AGTTCCTA", "GGCGTACC…

another example to better clarify the goal

julia> itr=[randstring("ACTG") for _ in 1:20]
20-element Vector{String}:
 "GCTATAAA"
 "CCTTCGGG"
 "TACGCGAG"
 "AAAAACGA"
 "ACCCGCTT"
 "GTGCGATC"
 "ATCTCGCG"
 "GGATATCA"
 "AACATATG"
 "GGTGCAGC"
 "GTTATGGC"
 "CGAGCTAA"
 "CCCTGCCA"
 "AAGACATC"
 "AGTATCCA"
 "CGATCGGT"
 "AGCCACAC"
 "TTCTATCA"
 "GCTTAGTC"
 "AGCATACA"

julia> 

julia> ...
8-element Vector{Vector{String}}:
 ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 ["AAAAACGA", "ACCCGCTT"]
 ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 ["AACATATG"]
 ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 ["CCCTGCCA"]
 ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 ["TTCTATCA", "GCTTAGTC", "AGCATACA"]

from the following it is better seen that each group leader has no letters in common with the previous group leader

8-element Dictionary{String, Vector{String}}
 "GCT" │ ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 "AAA" │ ["AAAAACGA", "ACCCGCTT"]
 "GTG" │ ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 "AAC" │ ["AACATATG"]
 "GGT" │ ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 "CCC" │ ["CCCTGCCA"]
 "AAG" │ ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 "TTC" │ ["TTCTATCA", "GCTTAGTC", "AGCATACA"]

Dan · March 18, 2023, 11:40pm

julia> using IterTools

julia> groupby(let c=""; x->c = any(∈(c),first(x,3)) ? c : first(x,3) end, itr)|>collect
8-element Vector{Vector{String}}:
 ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 ["AAAAACGA", "ACCCGCTT"]
 ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 ["AACATATG"]
 ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 ["CCCTGCCA"]
 ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 ["TTCTATCA", "GCTTAGTC", "AGCATACA"]

This works. What is the motivation behind these exercises?

rocco_sprmnt21 · March 19, 2023, 8:10am

I had a vague memory of this that I have now found, and I wanted to check if the use you showed of groupby() responded to some of the examples proposed there.

this last example is because I wanted to find a case where one of those convoluted propositions with the use of reducers and dictionaries would be useful, thinking that groupby can’t deal with such cases

reduce((d,c)->dist(c,d[1])<1 ? (d[1]=first(c,3);insert!(d[2], d[1], [c]);d) :  (push!(get!(()->[],d[2],d[1]),c);d),itr, init=["",Dictionary{String, Vector{String}}()])[2]

but then, thinking about it better, I found the way (similar to what you proposed) to use groupby also in this case



dist(n1,n2)= length(intersect(first(n1,3),first(n2,3)))
groupby(let c=first(itr[1],3); x->dist(x,c)==0 ? c=first(x,3) : c end, itr)|>collect


# or



dist1(n1,n2)=isempty(intersect(first(n1,3),first(n2,3))) ? first(n2,3) : missing

groupby(let c=""; x->c=coalesce(dist1(c,x),c) end, itr)|>collect

Topic		Replies	Views
Grouping vector of strings by left - right context General Usage question	0	218	June 14, 2021
Various by-group strategies compared Data	36	3947	January 30, 2018
What's the best way to separate elements of a vector based on its values? General Usage question , vector	6	433	June 22, 2022
Grouping by Split, apply, combine General Usage question , package , tuple , splitapplycombine	9	495	July 19, 2022
Efficient way to split string at specific index Performance question , strings	19	2417	June 24, 2022

Group by criteria

Related topics