This is a small utility package that help reduce the boilerplate when processing files containing biological records (fastqs, bams, vcfs, …). It deals with files management, opening & closing readers and writers, processing files in parallel, etc. In theory one can replicate most of the options in classic tools like samtools, vcftools, seqtk, etc (although it doesn’t really take advantage of indexed files currently).
Here’s an example where records in fasta files are filtered out according to the length of the sequence :
using BioRecordsProcessing, FASTX BioRecordsProcessing.process_directory(FASTX.FASTA, input_directory, "*.fa", output_directory; max_records=100) do record return FASTX.FASTA.seqlen(record) < 50 ? nothing : record end
I’ve used it a bit myself but I haven’t tested it very thoroughly, so double check that the outputs is correct.