Sorting lines by ID seems slow

tblz · December 16, 2018, 1:41am

Am trying to sort a file that looks like this:

sc285168_1 # 660 # 1046 # -1 # ID=285168_1;partial=00;
sc29363_1 # 57 # 887 # 1 # ID=29363_1;partial=00;
sc316197_1 # 17 # 418 # 1 # ID=316197_1;partial=00;
sc273994_1 # 1 # 243 # 1 # ID=273994_1;partial=00;
sc113906_1 # 436 # 1314 # 1 # ID=113906_1;partial=00;

By the “ID” field, where I want to sort first by first number before underscore and then by second (so “285168_2” is less than “285169_1” but greater than “285168_1”).

I tried to define a function like…

function split_int_id(x)
	id_field = split(split(x, '=')[2], ';')[1]
	id_scaff, id_num = split(id_field, '_')
	return Pair(parse(Int, id_scaff), parse(Int, id_num))
end

But sorting this through sorted_lines = sort(in_read, by = x -> split_int_id(x)) seems quite slow, about 6x slower than a python equivalent. I also tried something that doesn’t use sorts (maybe memory allocation from all those lists) but this function doesn’t do much better.

function index_int_id(x)
	id_start = findfirst(y -> y == '=', x) + 1
	id_end = findfirst(y -> y == ';', x) - 1
	sub_string = x[id_start:id_end]
	find_under = findfirst(y -> y == '_', sub_string)
	id_scaff = parse(Int64, sub_string[1:(find_under - 1)])
	id_gene = parse(Int64, sub_string[(find_under + 1):end])
	return Pair(id_scaff, id_gene)
end

Could someone help me get a performance boost? This seems like a fairly conventional usage of sort-by.

bennedich · December 16, 2018, 2:08am

One problem is that the by function will be called many times more than the number of elements you’re sorting, since it’s called for every comparison. If you’re comparing 1 000 elements for example, the by function might get called 20 000 times. So in your case, consider first looping over all lines and extracting the sort key(s), and then sorting using those keys.

The split function itself also doesn’t look very efficient; there’s a lot of splitting going on there. There should be plenty of room for improvement if my first suggestion is not enough (your index_int_id method looks more promising!).

tblz · December 16, 2018, 3:18am

Hmm. I didn’t realize that by would be called for each comparison. I guess I expected that it would be applied to each element once before sorting. I think that’s how python’s sorted(my_lines, key = ...) works which would explain why it seemed faster originally.

Running this as:

the_ids = map(split_int_id, my_file_lines)
sort_by_id = sortperm(the_ids)
sorted_lines = my_file_lines[sort_by_id]

Does seem to work and gets better performance. Thanks for the clarification.

Topic		Replies	Views
Sort array based on another array? Data sort , arrays	14	6264	March 16, 2023
Anoymous Functions Slow in sort Performance sort	3	869	August 22, 2018
How to sort a vector by an expensive function in place efficiently? Performance sorting	14	1723	March 8, 2022
Is there a more elegant solution than `sortperm(sortperm(...))` Performance question	6	528	February 22, 2020
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4645	January 29, 2018

Sorting lines by ID seems slow

Related topics