I’d like to announce a package I’ve been working on for my research group here at the Earlham Institute, (the first full julia project… I think I’m getting to them ) that is currently close to release. There are some features left to implement, but it is usable now.
It is called Pseudoseq, and it is a package designed for the simulation of DNA sequencing experiments.
The idea is not to replicate in sillico any specific machine or technology that exists, but instead to represent sequencing in the abstract, as a sampling process. This allows us to gain insight into the assumptions and intricacies of genome assembly algorithms, and test our abstract understanding vs reality. Currently I’m working on the ability to create abitrary chromosomes/genomes with certain features relevant to the genome assembly problem.
Very cool! It looks like this is designed to simulate isolate genomes. How difficult do you think it would be to extend it to simulate metagenomes? That is, many individual genomes that are found at different relative abundances?
I think that will be possible at some point. I’m currently adding the functionality to not only read in a genome from FASTA, but to make a genome with desired characteristics, and it’s going to be accessible from several levels, from just something like makegenome(args....) to a more fine-grained set of types and methods, where you can build something up chromosome by chromosome or haplotype by haplotype.
Once I have that, how that plugs into the rest of the sequencing - the Molecule Pool type, and so on, it will be clearer how something like metagenomes can be done. One way might be to simulate distinct genomes, and then mix the reads produced into a single sample at certain proportions. Another would be to mix the genomes at the start. I’m not sure which is the most elegant route right now.