Renaming IDs in a GFF3

Thank you @kevbonham, I agree! I’ll happily answer any questions about GenomicAnnotations, but it’s very difficult when the target keeps changing, or when the questions aren’t actually about the package.

@mictadlo, if you want to change the name of each chromosome you can iterate over the chromosomes:

for (i, chr) in enumerate(chrs)
    chr.name =  string(i)
end

Obviously, this assumes that chrs is an Array{Chromosome}, and that the chromosomes are stored in the order that they should be numbered, which you will have to verify yourself. Iterating over the chromosomes is one way to solve your problem with currentID, as well.
Regarding the documentation, I think adding an example section is a good idea so I’ll look into that.

If it increases performance it would certainly be welcome. I haven’t encountered any problems with the current implementation in some time now, though, so I don’t think it has high priority.

Thank you for your solution. I updated the code and it does what I wanted. However, I just wonder whether there would be better to do it:

using GenomicAnnotations
chrs = readgff("/Users/lorencm/chr1-2.augustus.hints_utr.gff3")
currentID = 0
currentChr = ""
for gene in @genes(chrs)
    if currentChr != parent(gene).name
        global currentChr = parent(gene).name
        global currentID = 0
    end
    
    if feature(gene) == :gene
        global currentID += 1
    end

    newID = lpad(currentID, 6, '0')
    chrID = SubString(parent(gene).name, 7)
    if !ismissing(gene.ID)
        geneIDextension = ""
        splitCurrentID = split(gene.ID,".",limit=2)
        if length(splitCurrentID) == 2
            println(splitCurrentID)
            geneIDextension="."*splitCurrentID[2]
            println(geneIDextension)
        end
        gene.ID = "$chrID$newID$geneIDextension"
        println(gene.ID)
    end
    if !ismissing(gene.Parent)
        println("HeLLO")
        println(gene.Parent)
        println(parent(gene).name)
        geneIDextension = ""
        splitCurrentID = split(gene.Parent,".",limit=2)
        if length(splitCurrentID) == 2
            println(splitCurrentID)
            geneIDextension="."*splitCurrentID[2]
            println(geneIDextension)
        end
        println("$(parent(gene).name)$newID")
        gene.Parent = "$chrID$newID$geneIDextension"
        println(gene.Parent)
        
        
        #println("!!Parent",parent(gene),parent(gene).name)
    end
end
open("updated.gbk", "w") do f
    printgff(f, chrs)
end

Thank you in advance

“Better” is quite subjective, and there are a lot of ways to think about it. Do you mean more performant, more clear, more idiomatic? Sometimes there are style things that just come down to personal preference.

For performance reasons, I think it would be better to wrap that all in a function and avoid all the globals.

Style-wise, I prefer using logging functionality (like @info) instead of print statements. I also like using short-circuit boolean operators instead of short if statements, eg

feature(gene) == :gene && (global currentID += 1)

instead of

if feature(gene) == :gene
    global currentID += 1
end

and use continue instead of a large indented if block, eg

ismissing(gene.Parent) && continue
# stuff...

instead of

if !ismissing(gene.Parent)
    # stuff....
end

though your way is arguably clearer. Keep trying different stuff, and see what you like. In the majority of code that I write, clarity and correctness is far more important than performance or some notion of “proper” code.

2 Likes