Am trying to sort a file that looks like this:
sc285168_1 # 660 # 1046 # -1 # ID=285168_1;partial=00;
sc29363_1 # 57 # 887 # 1 # ID=29363_1;partial=00;
sc316197_1 # 17 # 418 # 1 # ID=316197_1;partial=00;
sc273994_1 # 1 # 243 # 1 # ID=273994_1;partial=00;
sc113906_1 # 436 # 1314 # 1 # ID=113906_1;partial=00;
By the “ID” field, where I want to sort first by first number before underscore and then by second (so “285168_2” is less than “285169_1” but greater than “285168_1”).
I tried to define a function like…
function split_int_id(x)
id_field = split(split(x, '=')[2], ';')[1]
id_scaff, id_num = split(id_field, '_')
return Pair(parse(Int, id_scaff), parse(Int, id_num))
end
But sorting this through sorted_lines = sort(in_read, by = x -> split_int_id(x))
seems quite slow, about 6x slower than a python equivalent. I also tried something that doesn’t use sorts (maybe memory allocation from all those lists) but this function doesn’t do much better.
function index_int_id(x)
id_start = findfirst(y -> y == '=', x) + 1
id_end = findfirst(y -> y == ';', x) - 1
sub_string = x[id_start:id_end]
find_under = findfirst(y -> y == '_', sub_string)
id_scaff = parse(Int64, sub_string[1:(find_under - 1)])
id_gene = parse(Int64, sub_string[(find_under + 1):end])
return Pair(id_scaff, id_gene)
end
Could someone help me get a performance boost? This seems like a fairly conventional usage of sort-by.