Group by criteria

How would you solve the following problem?
Given the following array of strings we want to find all groups whose group leader starts with “AT”.
All the elements that are found before the first one starting with “AT” form a group by themselves.

julia> itr=[randstring("ACTG") for _ in 1:20]
20-element Vector{String}:
 "CTTCCGAG"
 "CCCGTGGT"
 "TCAAGGGT"
 "ATTAGATC"
 "TCTTACAC"
 "TTTCCGCC"
 "TCCGACCG"
 "GTCAGCTA"
 "CATGTTGC"
 "GAGGAACG"
 "GTCAATGC"
 "TACTCATT"
 "ATACTCTA"
 "AATTCACA"
 "AATCATAT"
 "GTATACCT"
 "ATTTTACT"
 "TTCAGAAG"
 "GTTGATGA"
 "GACGGCGG"

julia> mygroupby(itr, criteria)
4-element Vector{Tuple{Vararg{String}}}:
 ("CTTCCGAG", "CCCGTGGT", "TCAAGGGT")
 ("ATTAGATC", "TCTTACAC", "TTTCCGCC", "TCCGACCG", "GTCAGCTA", "CATGTTGC", "GAGGAACG", "GTCAATGC", "TACTCATT")
 ("ATACTCTA", "AATTCACA", "AATCATAT", "GTATACCT")
 ("ATTTTACT", "TTCAGAAG", "GTTGATGA")

Something like the answers here:

with findall(x -> startswith(x, "AT"), itr) to find the split points?

1 Like

Note that if you’re working with DNA/RNA and care about performance you might want to use BioSequences.jl which can be faster than using Strings to store genes.

using IterTools

groupby(let c=0; x->c+=startswith(x,"AT") end, itr)

gives an iterator returning a group each time:

julia> groupby(let c=0; x->c+=startswith(x,"AT") end, itr) |> collect
4-element Vector{Vector{String}}:
 ["CTTCCGAG", "CCCGTGGT", "TCAAGGGT"]
 ["ATTAGATC", "TCTTACAC", "TTTCCGCC", "TCCGACCG", "GTCAGCTA", "CATGTTGC", "GAGGAACG", "GTCAATGC", "TACTCATT"]
 ["ATACTCTA", "AATTCACA", "AATCATAT", "GTATACCT"]
 ["ATTTTACT", "TTCAGAAG", "GTTGATGA", "GACGGCGG"]


3 Likes

It’s not as good as @Dan’s, but I’m proposing this one too

reduce((s,c)->startswith(c,"AC") ? push!(s,[c]) :  [s[1:end-1];[push!(last(s),c)]],itr, init=[String[]])

This seems to perform well:

function group2(itr, str)
    ix = [1; findall(x -> startswith(x, str), itr); length(itr)+1]
    return [view(itr, ix[i]:(ix[i+1]-1)) for i in 1:(length(ix)-1)]
end
2 Likes

this should give the same result

groupby(let c=true; x->c=startswith(x,"CT") ? !c : c  end, itr)|>collect


PS
if group(by,itr) had the analogous form of reduce, for example, with a two-variable function and an init, it would be more flexible.

Same idea as last one, with some unicode xoring:

groupby(let c=true; x->c ⊻= startswith(x,"AT") end, itr)|>collect
1 Like

a variant

reduce((d,c)->startswith(c,"AC") ? (d[1]+=1;d[2][d[1]]=[c];d) :  (push!(get!(()->[],d[2],d[1]),c);d),itr, init=[0,Dict{Int, Vector{String}}()])

This answer seems pretty convoluted. Is it really helping answer the question? Julia obviously has a lot of ways to solve every problem, but I don’t think its useful to outline every single one.

1 Like

Yes. I know that the ones I wrote are convoluted and not very useful for the purpose of the problem.
But they are “useful” exercises… at least

Now supposing we have the following list of strings, we want to group them (scrolling them from first to last) in such a way that the strings with the “not too different root” are in the same group.

20-element Vector{String}:
 "GCCCCACA"
 "TCCATTTT"
 "TGCCCCGT"
 "GGGTTAGT"
 "GATTGAAG"
 "CCCGTAAT"
 "GGACCACT"
 "GGCGACAC"
 "GCTCTGTA"
 "AGTCCTGC"
 "TCATTACC"
 "GAAGTAGT"
 "AACTCGAA"
 "ATGACCCT"
 "TATTATCG"
 "CCTTGTCA"
 "TGCTCGCC"
 "CCGCATGG"
 "CTCTTTTG"
 "TTATGGAC"

more precisely with dist(n1,n2)>0 where

dist(n1,n2)= length(intersect(first(n1,3),first(n2,3)))

something like this should come

"GCC" => ["GCCCCACA", "TCCATTTT", "TGCCCCGT", "GGGTTAGT", "GATTGAAG", "CCCGTAAT", "GGACCACT", "GGCGACAC", "GCTCTGTA", "AGTCCTGC", "TCATTACC", "GAAGTAGT", "AACTCGAA", "ATGACCCT"], 
"TAT" => ["TATTATCG", "CCTTGTCA", "TGCTCGCC"], 
"CCG" => ["CCGCATGG", "CTCTTTTG"], 
"TTA" => ["TTATGGAC"]

This problem is poorly specified. What about ties? What about transitive differences?

If I understand the meaning of the observation, I specify better what I have already written
The groups are made up of consecutive elements of the list, similar to what the group function of itertools does

# Group consecutive values that share the same result of applying f.

The following should work

julia> using SplitApplyCombine;

julia> itr = [randstring("ACTG") for _ in 1:20];

julia> dist(n1, n2)= length(intersect(first(n1, 3), first(n2, 3)));

julia> diffs = similar(itr, Bool);

julia> for i in eachindex(itr)
           if i == firstindex(itr)
               diffs[i] = false
           else
               diffs[i] = dist(diffs[i-1], diffs[i]) > 0
           end
       end;

julia> diffs_groups = cumsum(diffs);

julia> group(diffs_goups, itr)
12-element Dictionaries.Dictionary{Int64, Vector{String}}
  0 │ ["GCGGGCCG"]
  1 │ ["ATGTCTGT", "TGTTATCA"]
  2 │ ["GCAAGAAC"]
  3 │ ["AGTTCCTA"]
  4 │ ["GGCGTACC", "GCTATGAT"]
  5 │ ["GACTCACG", "CAATGGTG"]
  6 │ ["AGAGCCTG"]
  7 │ ["CACGGGGA", "CATGATTG", "TTCTCAAG", "GAGTCTAT"]
  8 │ ["CCATGATA", "TCGCCGTA"]
  9 │ ["GGCATATT"]
 10 │ ["CTCTAACT", "CGTACGCG"]
 11 │ ["CTCGCTGT"]

in this case, the whole list is one group.
For each string, in the first 3 letters there is at least one of the first 3 of the list leader

1-element Dictionary{String, Vector{String}}
 "GCG" │ ["GCGGGCCG", "ATGTCTGT", "TGTTATCA", "GCAAGAAC", "AGTTCCTA", "GGCGTACC…

another example to better clarify the goal

julia> itr=[randstring("ACTG") for _ in 1:20]
20-element Vector{String}:
 "GCTATAAA"
 "CCTTCGGG"
 "TACGCGAG"
 "AAAAACGA"
 "ACCCGCTT"
 "GTGCGATC"
 "ATCTCGCG"
 "GGATATCA"
 "AACATATG"
 "GGTGCAGC"
 "GTTATGGC"
 "CGAGCTAA"
 "CCCTGCCA"
 "AAGACATC"
 "AGTATCCA"
 "CGATCGGT"
 "AGCCACAC"
 "TTCTATCA"
 "GCTTAGTC"
 "AGCATACA"

julia> 

julia> ...
8-element Vector{Vector{String}}:
 ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 ["AAAAACGA", "ACCCGCTT"]
 ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 ["AACATATG"]
 ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 ["CCCTGCCA"]
 ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 ["TTCTATCA", "GCTTAGTC", "AGCATACA"]

from the following it is better seen that each group leader has no letters in common with the previous group leader

8-element Dictionary{String, Vector{String}}
 "GCT" │ ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 "AAA" │ ["AAAAACGA", "ACCCGCTT"]
 "GTG" │ ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 "AAC" │ ["AACATATG"]
 "GGT" │ ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 "CCC" │ ["CCCTGCCA"]
 "AAG" │ ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 "TTC" │ ["TTCTATCA", "GCTTAGTC", "AGCATACA"]
julia> using IterTools

julia> groupby(let c=""; x->c = any(∈(c),first(x,3)) ? c : first(x,3) end, itr)|>collect
8-element Vector{Vector{String}}:
 ["GCTATAAA", "CCTTCGGG", "TACGCGAG"]
 ["AAAAACGA", "ACCCGCTT"]
 ["GTGCGATC", "ATCTCGCG", "GGATATCA"]
 ["AACATATG"]
 ["GGTGCAGC", "GTTATGGC", "CGAGCTAA"]
 ["CCCTGCCA"]
 ["AAGACATC", "AGTATCCA", "CGATCGGT", "AGCCACAC"]
 ["TTCTATCA", "GCTTAGTC", "AGCATACA"]

This works. What is the motivation behind these exercises?

1 Like

I had a vague memory of this that I have now found, and I wanted to check if the use you showed of groupby() responded to some of the examples proposed there.

this last example is because I wanted to find a case where one of those convoluted propositions with the use of reducers and dictionaries would be useful, thinking that groupby can’t deal with such cases

reduce((d,c)->dist(c,d[1])<1 ? (d[1]=first(c,3);insert!(d[2], d[1], [c]);d) :  (push!(get!(()->[],d[2],d[1]),c);d),itr, init=["",Dictionary{String, Vector{String}}()])[2]

but then, thinking about it better, I found the way (similar to what you proposed) to use groupby also in this case



dist(n1,n2)= length(intersect(first(n1,3),first(n2,3)))
groupby(let c=first(itr[1],3); x->dist(x,c)==0 ? c=first(x,3) : c end, itr)|>collect


# or



dist1(n1,n2)=isempty(intersect(first(n1,3),first(n2,3))) ? first(n2,3) : missing

groupby(let c=""; x->c=coalesce(dist1(c,x),c) end, itr)|>collect