Support for cancer data in GeneticVariation?

question

#1

I want to try to use Julia for analyzing cancer genomics data but I am not sure how. BioJulia seems to cover the basics, and I have contacted Julia Computing for info about JuliaRun and JuliaDB.

Genomics data are increasingly important for cancer research. However, most platforms assume healthy DNA and do not use Julia. An example is Databricks “Unified Analytics Platforms for Genomics” which is based on ADAM which is based on PySpark, the Python API for Apache Spark. Ben Ward had an inspiring talk at JuliaCon 2018 titled BioJulia and Bioinformatics in Julia: Past, Present, Future. What seems specific in genomics for cancer data is what Ben and others calls “variation”.

I identified GeneticVariation.jl as a possible Julia solution for the cancer variation problem. The package has this description:

GeneticVariation provides types and methods for working with datasets of genetic variation. It provides a VCF and BCF parser, as well as methods for working with variation in sequences such as evolutionary distance computation, and counting different mutation types.

Can GeneticVariation be used for analyzing cancer genomics data? Is there a way in Julia to use a somatic variant caller like GATK4 Mutect2?

Any recommendation for how to use Julia for cancer genomics data will be most appreciated.


#2

In my experience you can easily read most major file formats with BioJulia packages. I would say at the moment it’s more suited for developers than end users, because it has strong foundations, but you often need to write some custom code to do something. For example there’s no Ensembl dedicated package (that I know of), but it’s easy to read their GFF3 files with Bio.jl or make requests via their web API with HTTP.jl.

For GATK I would just build the system command in Julia, run it, and then read the VCF file (maybe use something like Dispatcher.jl to manage external commands).


#3

@jonathanBieler Thank you. I found the Mutect2 documentation. It suggests to me that JavaCall.jl should be able to call Mutect2 but it is very possible that I do not understand yet how to use JavaCall. This is example code from the Mutect2 documentation for Tumor/Normal variant calling:

java -jar GenomeAnalysisTK.jar
-T MuTect2
-R reference.fasta
-I:tumor tumor.bam
-I:normal normal.bam
[–dbsnp dbSNP.vcf]
[–cosmic COSMIC.vcf]
[-L targets.interval_list]
-o output.vcf

I need to learn more Julia and try it! :slight_smile:

Edit: The example above is for GATK3.8 rather than GATK4 (referred to in original posting). I will accept the reply of @jonathanBieler as the correct answer.


#4

GATK is written in Java but you can call it like any other program (once it’s installed) so I don’t think you would need JavaCall. Maybe it could be useful if you want to interact with GATK internals though.

For example to get help about a particular tool you could do:

gatk_help(tool) =  readstring(`gatk $tool -h`)

gatk_help("MarkDuplicates")