How to create all possible k-mers with a set of alphabets?

I am trying to use Iterators.product to generate a list of all possible k-mers with a set of alphabets. Since I am a biologist, these letters happen to be the bases of DNA :smiley: : “ATGC”

If I want to generate 3-mers then this works perfectly:

nucs = "ATGC"
join.(Iterators.product(nucs,nucs,nucs))
4×4×4 Array{String, 3}:
[:, :, 1] =
 "AAA"  "ATA"  "AGA"  "ACA"
 "TAA"  "TTA"  "TGA"  "TCA"
 "GAA"  "GTA"  "GGA"  "GCA"
 "CAA"  "CTA"  "CGA"  "CCA"

[:, :, 2] =
 "AAT"  "ATT"  "AGT"  "ACT"
 "TAT"  "TTT"  "TGT"  "TCT"
 "GAT"  "GTT"  "GGT"  "GCT"
 "CAT"  "CTT"  "CGT"  "CCT"

[:, :, 3] =
 "AAG"  "ATG"  "AGG"  "ACG"
 "TAG"  "TTG"  "TGG"  "TCG"
 "GAG"  "GTG"  "GGG"  "GCG"
 "CAG"  "CTG"  "CGG"  "CCG"

[:, :, 4] =
 "AAC"  "ATC"  "AGC"  "ACC"
 "TAC"  "TTC"  "TGC"  "TCC"
 "GAC"  "GTC"  "GGC"  "GCC"
 "CAC"  "CTC"  "CGC"  "CCC"

However, if I want to extend the same method for longer words, I have to keep repeating “nucs” as the arguments for Iterators.product. What I would like is a smart way where I specify the word size and I generate all possible words of that size. So far, I tried this:

join.(Iterators.product(Iterators.repeated(nucs,3)))
3-element Vector{String}:
 "ATGC"
 "ATGC"
 "ATGC"

I tried with fill too:

join.(Iterators.product(fill(nucs,1,3)))
join.(Iterators.product(fill(nucs,3)))

but I get the same output.

Using collect(nucs) instead of nucs also doesn’t give me the desired output

Any solutions?

Is

what you are looking for?

Answer: No, OP is looking for solution below using Iterators.product with splatting.

I am looking for combinations of strings as I described. Thanks for the suggestion. Can you tell me which function I can use from this library to get what I want?

join.(Iterators.product(Iterators.repeated(nucs,3)...))

may be what you are looking for. Notice the ... compared to what you suggested: this is called splatting.

Its performance can be problematic if you splat large iterators, but in this case the cost of calling Iterators.product will be much higher anyway so it shouldn’t matter. I would not advise replacing 3 with a number larger than 10 though, as you will probably need a few Gb to store the resulting array.

3 Likes

That Combinatorics version is much more efficient than splatting :+1:

This is great but I am not able to get longer strings. It returns String[]

Ah right, it’s more efficient because it’s wrong :sweat_smile:
You are not looking for permutations of “ACGT” but for the cartesian product of this set of characters (you can check that the size of the two results do not match). I don’t know if there is something in Combinatorics that works… But the splatting version should yield the desired result.

Splatting works for me quite well (takes 3 seconds to produce all 10-mers). I just do this once so speed is not a big issue right now.

Another option, using Combinatorics:

using Combinatorics
with_replacement_combinations("ACGT",3) .|> x->multiset_permutations(x,3) .|> join
1 Like

In this particular case, you can use the upcoming package Kmers.jl (not released yet, but close). There is no official “all kmers iterator”, but you can make one:

using Kmers

struct AllKmerIterator{A <: NucleicAcidAlphabet, K} end
Base.IteratorSize(::Type{<:AllKmerIterator}) = Base.HasLength()
function Base.length(::AllKmerIterator{A, K}) where {A, K}
    length(A())^K
end

function Base.iterate(it::AllKmerIterator{A, K}, state=0) where {A, K}
    state == length(it) && return nothing
    Kmers.kmertype(Kmer{A, K})((state % UInt64,)), state+1
end

This is much more efficient than the other suggestions, and yields Kmer objects, which are subtypes of BioSequence. For example:

julia> it = AllKmerIterator{DNAAlphabet{2}, 4}()
Main.AllKmerIterator{DNAAlphabet{2}, 4}()

julia> seq = first(it)
DNA 4-mer:
AAAA

julia> Kmers.reverse_complement(seq)
DNA 4-mer:
TTTT

My laptop iterates over all 10-mers using this code in about 2.5 miliseconds.
Be aware that there will be rough edges since Kmers is in the alpha stage still.

Alternatively, BioSequences v2 contain a more primitive version of Kmers than Kmers.jl, but is well-tested and stable.

2 Likes

Do you need to use join and Strings? If you used collect and Symbols, you’d end up with tuples instead of strings, but it’s the same information and much faster:

nucs_sym = (:A, :T, :G, :C)
nucs_str = ("A", "T", "G", "C")
@btime join.(Iterators.product(Iterators.repeated($nucs_str, 3)...)); # 4.773 \mu s, 258 alloc, 14.16 KiB
@btime collect(Iterators.product(Iterators.repeated($nucs_sym, 3)...)); # 97 ns, 1 alloc, 1.59 KiB

I don’t know what you’re doing and so maybe you need to convert to strings somehow in the end and this doesn’t help. Just thought I’d mention it because strings seem a bit wasteful here.

3 Likes

Thanks. Symbols are indeed fast but I am not able to create certain combinations that I can with strings. For example I need to create a set of sequences: (T/A/C) (CAT) (G/A/C). I should get 9 sequences.

If I am using strings, I can simply do this:

join.(Iterators.product("TAC","C", "A","T" ,"GAC")

How can I achieve the same using tuples of symbols?

julia> for p in Iterators.product((:T,:A,:C), (:C,),(:A,),(:T,), (:G,:A,:C))
       @show p
       end
p = (:T, :C, :A, :T, :G)
p = (:A, :C, :A, :T, :G)
p = (:C, :C, :A, :T, :G)
p = (:T, :C, :A, :T, :A)
p = (:A, :C, :A, :T, :A)
p = (:C, :C, :A, :T, :A)
p = (:T, :C, :A, :T, :C)
p = (:A, :C, :A, :T, :C)
p = (:C, :C, :A, :T, :C)
1 Like