Genomic Interval Metadata & Annotation

Hi,

Pretty new to Julia but looking forward to its continued growth and use for genomic data. Given that the current BioJulia documentation is a bit sparse, I wanted to get the opinion of people who are more familiar with the codebase about whether this need is currently being met.

In most of my current work I use the GenomicRanges R/Bioconductor package for all my genomic interval operations. I notoriously abuse the metadata columns of this data structure to track basically everything about my peaks, using a single large GRanges object as sort of a database backend that I subset from to make plots/do analysis. In the current setup of the GenomicFeatures.Interval, there also exists a metadata-like column. Would the Julia solution to my GRanges database be to use a dataframe for the metadata field? Is there a datastructure that is better suited to this in the BioJulia/Julia universe (ie just a dataframe?)? The issue I sometimes wind up having in R is having to jump things back and forth from dataframe to GRanges to accomplish certain tasks, or having to use non-intuitive lookup tables to link Genomic features to some kind of meaningful data (like a clustered count matrix for heatmaps) so it could be that developing a BioJulia framework that prevents a similar issue could be a useful one.

You may find the GenomicTable type in GenomicVectors.jl useful. It is a subtype of AbstractDataTable that has a GenomicRanges or GenomicPositions as a row index. The end result is something much like BioConductor’s GenomicRanges with a DataFrame as metadata.

It sounds like @phaverty 's package might be your best bet, since it’s directly inspired by the R package you’re used to.

But we’ve also got BioJulia, which is undergoing a bit of a reorganization at the moment, so it might be tough to find what you’re looking for. It sounds like you might want GenomicFeatures.jl which has intervals. Eg:

julia> i = Interval("chr1", 10000, 20000, '+', "some annotation")
GenomicFeatures.Interval{String}:
  sequence name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: +
  metadata: some annotation

julia> seqname(i)
"chr1"

julia> leftposition(i)
10000

julia> rightposition(i)
20000

julia> strand(i)
STRAND_POS

julia> metadata(i)
"some annotation"

Of note, that metadata field can be of any type, so you could store a Dict or even a DataFrame in the metadata field if you wanted to. That package also has IntervalCollections, so you can group all of the intervals for a given sequence in an efficient way.

I’m not the author of this package, but the Bio.jl folks typically are trying to write code that takes advantage of julia’s awesomeness in innovative ways. So stuff might be less familiar, but ultimately may be faster than code ported from a different language (I’m not familiar with that R package, so I don’t know if it’s better or worse for what you want). That said, sometimes ease > speed, so YMMV.

Maybe @bicycle1885 or @Ward9250 might have more to say on the design philosophy for Intervals.

@synstrom Crap, I see now that my response is basically exactly what you’re asking about :-/. Reading comprehension fail!

In which case I don’t have any relevant info to add, other than my tags of some of the key architects of the Bio package. I’ll see myself out.

Would the Julia solution to my GRanges database be to use a dataframe for the metadata field?

GenomicFeatures.Interval{T} is parameterized by its metadata type and can store any object you want there, such as Float64, String, Dict, Vector, DataFrame, etc.

Is there a datastructure that is better suited to this in the BioJulia/Julia universe (ie just a dataframe?)?

As @phaverty already mentioned, his GenomicVectors.jl package would be a good solution to your problem. Unfortunately, I have no idea about the performance for your tasks.

GenomicFeatures.jl has IntervalCollection type, which is conceptually a sequence of Intervals with index. IntervalCollection is more like a dictionary with interval keys and metadata values while GenomicVectors.jl is more like a data frame. The index data structure is based on the interval tree built on top of B+ tree while GRanges of R uses NCList by default. Both support overlap query but they are very different data structures and hence the performance characteristics would be very different as well. We’ve compared these two before and concluded that the interval tree is better for our package.

I cannot reach a definitive conclusion of your problem but I hope my comments are helpful to you.

GenomicVectors.jl has a few relevant types. GenomicRanges is like R’s
GenomicRanges minus the metadata part (it is a vector). It actually uses
features from GenomicFeatures.jl to represent intervals and to do overlap
queries. It behaves like Vector{Interval}, rather than IntervalCollection,
meaning that it can be set to a specific order (e.g. sorted by some score,
genome location, etc.)

GenomicTable is a DataTable that uses a GenomicRanges as a row index. I
like having a proper DataTable so you can do all the usual data frame
operations. My understanding of the metadata for an IntervalCollection is
that each Interval has its own metadata slot, so you would have to write
your own functions to operate on the collection of them.

I expect that the speed of things like overlap queries will be a bit slower
than GenomicFeatures.jl as it is necessary to build an IntervalCollection
before doing any queries, but it should be plenty fast relative to R. It’s
all a bit experimental and I don’t know how the pros and cons of these
different approaches will balance out. I’d be very interested in any
feedback you can offer.

Hope that helps.

Pete

I seem to recall a Github issue you posted saying you guys were converting the Interval type to immutable–this would make the metadata column static as well, correct? Regardless, your comments are very helpful, and I think I’ll give GenomicVectors.jl a shot. For my purposes, I care less about interval operations and more about having a datastructure that lets me track and resort data based on other parameters that are not genomic position. Genomic position is used as a secondary characteristic to retrieve sequence information or interval overlaps later between groups that fit the filtered/sorted characteristics.

I’ll give this a shot in the next few days when I’ve got some downtime and let you know.

If the things you work with are type-like you can also just use a type. That is instead of constructing an interval and putting things into its metadata you do:

type A
    x
    y
    interval
end

(or inherit from Interval).

The advantage is that you can build a more natural and structured grammar using types, i.e. if you have an array of peaks and you want to do the histogram of their height:

hist(map(height,peaks))

Yeah, the metadata is also immutable. But if it is a reference value (e.g. Vector), you can update the value in-place. Also, if you wrap the metadata with Ref, you can still update the value even if it is not a reference value (e.g. Int) like this:

julia> i = Interval("chr1", 1:10, '+', Ref(0))
GenomicFeatures.Interval{Base.RefValue{Int64}}:
  sequence name: chr1
  leftmost position: 1
  rightmost position: 10
  strand: +
  metadata: Base.RefValue{Int64}(0)

julia> i.metadata[]  # weird brackets are needed
0

julia> i.metadata[] += 1  # update!
1

julia> i.metadata[]
1

So, basically making Interval immutable will not limit the functionality.

1 Like