Am trying to sort a file that looks like this:
sc285168_1 # 660 # 1046 # -1 # ID=285168_1;partial=00;
sc29363_1 # 57 # 887 # 1 # ID=29363_1;partial=00;
sc316197_1 # 17 # 418 # 1 # ID=316197_1;partial=00;
sc273994_1 # 1 # 243 # 1 # ID=273994_1;partial=00;
sc113906_1 # 436 # 1314 # 1 # ID=113906_1;partial=00;
By the “ID” field, where I want to sort first by first number before underscore and then by second (so “285168_2” is less than “285169_1” but greater than “285168_1”).
I tried to define a function like…
function split_int_id(x) id_field = split(split(x, '='), ';') id_scaff, id_num = split(id_field, '_') return Pair(parse(Int, id_scaff), parse(Int, id_num)) end
But sorting this through
sorted_lines = sort(in_read, by = x -> split_int_id(x)) seems quite slow, about 6x slower than a python equivalent. I also tried something that doesn’t use sorts (maybe memory allocation from all those lists) but this function doesn’t do much better.
function index_int_id(x) id_start = findfirst(y -> y == '=', x) + 1 id_end = findfirst(y -> y == ';', x) - 1 sub_string = x[id_start:id_end] find_under = findfirst(y -> y == '_', sub_string) id_scaff = parse(Int64, sub_string[1:(find_under - 1)]) id_gene = parse(Int64, sub_string[(find_under + 1):end]) return Pair(id_scaff, id_gene) end
Could someone help me get a performance boost? This seems like a fairly conventional usage of sort-by.