Sorting lines by ID seems slow


#1

Am trying to sort a file that looks like this:

sc285168_1 # 660 # 1046 # -1 # ID=285168_1;partial=00;
sc29363_1 # 57 # 887 # 1 # ID=29363_1;partial=00;
sc316197_1 # 17 # 418 # 1 # ID=316197_1;partial=00;
sc273994_1 # 1 # 243 # 1 # ID=273994_1;partial=00;
sc113906_1 # 436 # 1314 # 1 # ID=113906_1;partial=00;

By the “ID” field, where I want to sort first by first number before underscore and then by second (so “285168_2” is less than “285169_1” but greater than “285168_1”).

I tried to define a function like…

function split_int_id(x)
	id_field = split(split(x, '=')[2], ';')[1]
	id_scaff, id_num = split(id_field, '_')
	return Pair(parse(Int, id_scaff), parse(Int, id_num))
end

But sorting this through sorted_lines = sort(in_read, by = x -> split_int_id(x)) seems quite slow, about 6x slower than a python equivalent. I also tried something that doesn’t use sorts (maybe memory allocation from all those lists) but this function doesn’t do much better.

function index_int_id(x)
	id_start = findfirst(y -> y == '=', x) + 1
	id_end = findfirst(y -> y == ';', x) - 1
	sub_string = x[id_start:id_end]
	find_under = findfirst(y -> y == '_', sub_string)
	id_scaff = parse(Int64, sub_string[1:(find_under - 1)])
	id_gene = parse(Int64, sub_string[(find_under + 1):end])
	return Pair(id_scaff, id_gene)
end

Could someone help me get a performance boost? This seems like a fairly conventional usage of sort-by.


#2

One problem is that the by function will be called many times more than the number of elements you’re sorting, since it’s called for every comparison. If you’re comparing 1 000 elements for example, the by function might get called 20 000 times. So in your case, consider first looping over all lines and extracting the sort key(s), and then sorting using those keys.

The split function itself also doesn’t look very efficient; there’s a lot of splitting going on there. There should be plenty of room for improvement if my first suggestion is not enough (your index_int_id method looks more promising!).


#3

Hmm. I didn’t realize that by would be called for each comparison. I guess I expected that it would be applied to each element once before sorting. I think that’s how python’s sorted(my_lines, key = ...) works which would explain why it seemed faster originally.

Running this as:

the_ids = map(split_int_id, my_file_lines)
sort_by_id = sortperm(the_ids)
sorted_lines = my_file_lines[sort_by_id]

Does seem to work and gets better performance. Thanks for the clarification.