BioRecordsProcessing.jl aims at processing files containing biological records using a minimum amount of boilerplate. It can be used in place of tools like samtools, vcftools, sektq, etc.
Features :
- Read from disk, modify/filter the records (with a user defined function) and write back to disk
- Read or write to memory
- Process whole directories in parallel
- Handle paired files
- Handle compressed files
- Supports VCFs, FASTA/Q, S/BAM (and possibly any Record type that uses the bio record “interface”)
Example :
using BioRecordsProcessing, FASTX, BioSequences
p = Pipeline(
Reader(FASTX.FASTA, File(filepath)),
record -> begin
sequence(LongDNA{4}, record)
end,
Collect(LongDNA{4}),
)
run(p)
# output
2-element Vector{LongSequence{DNAAlphabet{4}}}:
CTTGGCATACTCAAACTCTT
CTTGGCATACTCAAACTCTT
Missing features :
I think it would be useful to be able to provide a genomic interval to read from (specially for BAM files) and also to group records based on some user-defined criteria (e.g. read names for pair-end BAM files).