Another question on eachoverlap in GenomicFeatures.jl

zhangchunyong · November 18, 2022, 3:33am

using GenomicFeatures
using DataFrames


col = [
	Interval("chr1", 10628, 10683, '?', "abc")
	Interval("chr1", 10643, 10779, '?', "abc")
	Interval("chr1", 10645, 10748, '?', "abc")
	Interval("chr1", 10648, 10786, '?', "abc")
	Interval("chr1", 10676, 10767, '?', "abc")
	Interval("chr1", 10690, 10731, '?', "abc")
	Interval("chr1", 10707, 10793, '?', "abc")
	Interval("chr1", 10742, 10814, '?', "abc")
] |> IntervalCollection

hhh = [
	Interval("chr1", 10631, 10638)
	Interval("chr1", 10633, 10635)
	Interval("chr1", 10636, 10650)
	Interval("chr1", 10638, 10649)
	Interval("chr1", 10641, 10651)
	Interval("chr1", 10644, 10689)
	Interval("chr1", 10645, 10660)
	Interval("chr1", 10650, 10650)
] |> IntervalCollection

But I want to findoverlap with minoverlap=5.I tried like the following

function number_interval(tp::Tuple)
	# Unpack Tuple.
	(i, interval) = tp

	# Setup numbered metadata.
	new_metadata = (
		i = i,
		original = metadata(interval)
	)

	# Create new interval with numbered metadata.
	return Interval(
		seqname(interval),
		leftposition(interval),
		rightposition(interval),
		strand(interval),
		new_metadata
	)
end

# Convert intervals.
col_numbered = col |> enumerate .|> number_interval |> IntervalCollection
hhh_numbered = hhh |> enumerate .|> number_interval |> IntervalCollection


# Reproduce example.
df = DataFrame()

for (c, h) in eachoverlap(col_numbered, hhh_numbered)

	result = (
		queryHits = GenomicFeatures.metadata(c).i, 
		subjectHits = GenomicFeatures.metadata(h).i
	)

	push!(df, result)
end

julia> df
27×2 DataFrame
 Row │ queryHits  subjectHits
     │ Int64      Int64
─────┼────────────────────────
   1 │         1            1
   2 │         1            2
   3 │         1            3
   4 │         1            4
   5 │         1            5
   6 │         1            6
   7 │         1            7
   8 │         1            8
   9 │         2            3
  10 │         2            4
  11 │         2            5
  12 │         2            6
  13 │         2            7
  14 │         2            8
  15 │         3            3

But I can not solve the problem with minoverlap=5.For example,hhh’s first line has more than 5 overlaps with col’s first ,I will output the index of col’s index and hhh’s index.hhh’s second line does not have 5 overlaps,it will not occur in the final dataframe.The function above seems to solve minoverlap=1.What should I do to solve this problem? Thank all guys for helping me!

kevbonham · November 21, 2022, 9:20pm

Hi there! Welcome to the community . Please take a look at this post for some suggestions on how best to formulate a questions. In particular, launching into code without context makes it difficult for readers to know what you’re asking for, and the title “Question about X in package Y” doesn’t provide very much context.

Also, as a matter of etiquette, please do not ping the same people in multiple different ways. I received notifications for this post, direct message on discourse, an issue on Github, and in my personal email, all in the course of a few days (including the weekend). While I can appreciate your eagerness to solve your problem, it’s not good form to send people direct messages unless you have a question that specifically involves them. For my part, I didn’t respond here or on github because I don’t work with this package and so I can’t help.

zhangchunyong · November 22, 2022, 3:48am

I am so sorry to bother you.Thank you for your reminding, I will pay attention to these things in the future.

CiaranOMara · November 27, 2022, 11:41pm

Below is an example of how I would introduce a filtering step.
Adjust the filtering step as required – your overlap criteria weren’t clear to me.

using GenomicFeatures #v2
using DataFrames


col = [
	Interval("chr1", 10628, 10683, '?', "abc")
	Interval("chr1", 10643, 10779, '?', "abc")
	Interval("chr1", 10645, 10748, '?', "abc")
	Interval("chr1", 10648, 10786, '?', "abc")
	Interval("chr1", 10676, 10767, '?', "abc")
	Interval("chr1", 10690, 10731, '?', "abc")
	Interval("chr1", 10707, 10793, '?', "abc")
	Interval("chr1", 10742, 10814, '?', "abc")
] |> IntervalCollection

hhh = [
	Interval("chr1", 10631, 10638)
	Interval("chr1", 10633, 10635)
	Interval("chr1", 10636, 10650)
	Interval("chr1", 10638, 10649)
	Interval("chr1", 10641, 10651)
	Interval("chr1", 10644, 10689)
	Interval("chr1", 10645, 10660)
	Interval("chr1", 10650, 10650)
] |> IntervalCollection

function number_interval(tp::Tuple)
	# Unpack Tuple.
	(i, interval) = tp

	# Setup numbered metadata.
	new_metadata = (
		i = i,
		original = GenomicFeatures.metadata(interval)
	)

	# Create new interval with numbered metadata.
	return Interval(
		seqname(interval),
		leftposition(interval),
		rightposition(interval),
		strand(interval),
		new_metadata
	)
end

"""
Given intervals a and b, that are assumed to overlap, the overlap is the span between the maximum left position and minimum right position.
interval: a    b    a   b
    side: l    l    r   r
  layout: |----|----|---|
"""
function overlap_span(interval_a, interval_b)
	left = max(leftposition(interval_a), leftposition(interval_b))
	right = min(rightposition(interval_a), rightposition(interval_b))
	return length(left:right)
end

# Convert intervals.
col_numbered = col |> enumerate .|> number_interval |> IntervalCollection
hhh_numbered = hhh |> enumerate .|> number_interval |> IntervalCollection

# Setup overlap iterator.
itr = eachoverlap(col_numbered, hhh_numbered); # Note: suppressing IntervalTree's output.

# Setup "on the fly" filtering.
itr = Iterators.filter(itr) do (c, h)
	# Select pairs with a minimum overlap of 5.
	if overlap_span(c, h) >= 5
		return true
	end
	
	# Reject overlapping pair.
	return false
end

# Setup result generator.
itr = Base.Generator(itr) do (c, h)
	result = (
		queryHits = GenomicFeatures.metadata(c).i,
		subjectHits = GenomicFeatures.metadata(h).i
	)
	return result
end

# Collect results in a `DataFrame`.
df = DataFrame(itr)

julia> df
19×2 DataFrame
 Row │ queryHits  subjectHits 
     │ Int64      Int64       
─────┼────────────────────────
   1 │         1            1
   2 │         1            3
   3 │         1            4
   4 │         1            5
   5 │         1            6
   6 │         1            7
   7 │         2            3
   8 │         2            4
   9 │         2            5
  10 │         2            6
  11 │         2            7
  12 │         3            3
  13 │         3            4
  14 │         3            5
  15 │         3            6
  16 │         3            7
  17 │         4            6
  18 │         4            7
  19 │         5            6

This question of zhangchunyong’s continues from the following.

Topic		Replies	Views
Eachoverlap's question in GenomicFeatures .jl New to Julia question	0	223	November 19, 2022
Some questions about eachoverlap in GenomicFeatures.jl Biology, Health, and Medicine	4	676	August 21, 2022
Some questions about eachoverlap in GenomicFeatures General Usage	1	228	August 30, 2022
Using DataFrames to split overlapping datetime intervals into non overlapping intervals General Usage question , dataframes , dataframesmeta	9	361	July 23, 2023
Function findall and overlapping General Usage	3	72	November 26, 2024

Another question on eachoverlap in GenomicFeatures.jl

Related topics