Renaming IDs in a GFF3

mictadlo · January 24, 2020, 1:46pm

Thank you. How is it possible to replace the current ID (g4510) with NbV1Ch08004510, NbV1Ch08004510.t1, NbV1Ch08004510t1.3UTR1, …?

NbV1Ch08_g4510_004510
NbV1Ch08_g4510.t1_004510
NbV1Ch08_g4510.t1.3UTR1_004510
NbV1Ch08_g4510.t1.exon1_004510
NbV1Ch08_g4510.t1.CDS1_004510
NbV1Ch08_g4510.t1.CDS2_004510
NbV1Ch08_g4510.t1.exon2_004510
NbV1Ch08_g4510.t1.5UTR1_004510

Thank you in advance,

kdyrhage · January 24, 2020, 1:48pm

Remove _$(gene.ID)_ from the println line.

mictadlo · January 24, 2020, 1:53pm

Sorry, I have forgotten to mention that I would like to write the new ID into a new Gff3 file. Could you please show me how to replace the old ID with the new one and write the changes to a new GFF3 file?
Thank you in advance,

kdyrhage · January 24, 2020, 1:58pm

using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
currentID = 0
for gene in @genes(chrs, !ismissing(:ID))
    if feature(gene) == :gene
        global currentID += 1
    end
    newID = lpad(currentID, 6, '0')
    gene.ID = "$(parent(gene).name)$newID"
end
printgff("newIDs.gff", chrs)

mictadlo · January 24, 2020, 2:03pm

Thank you but newIDs.gff still contain the old IDs.

What could cause it?

Thank you in advance,

kdyrhage · January 24, 2020, 2:10pm

Not when I run it exactly as I wrote it in my last post. The important part is this line:

    gene.ID = "$(parent(gene).name)$newID"

Are you sure that you ran it?

mictadlo · January 24, 2020, 2:14pm

I copied your code from your previous post and it appears it did not change the IDs. I am not quite sure what could go wrong?

kdyrhage · January 24, 2020, 2:41pm

When I run the code I posted, newIDs.gff contains this (first 5 lines):

##gff-version 3
NbV1Ch08	AUGUSTUS	gene	7015	29794	0.01	-	.	ID=NbV1Ch08000001
NbV1Ch08	AUGUSTUS	mRNA	7015	29794	0.01	-	.	ID=NbV1Ch08000001;Parent=g1
NbV1Ch08	AUGUSTUS	transcription_end_site	7015	7015	.	-	.	Parent=g1.t1
NbV1Ch08	AUGUSTUS	three_prime_utr	7015	8531	0.2	-	.	ID=NbV1Ch08000001;Parent=g1.t1

If you don’t get this my best guess is that you aren’t running the previous version where it said println(...) instead of gene.ID = ....

mictadlo · January 24, 2020, 2:56pm

Thank you, I closed everything and reopen and new IDs appeared in the new GFF3 file. However, I noticed that e.g. Parent=g1 has not been updated to Parent=NbV1Ch08000001. How is it possible to change Parent ID?

Thank you in advance,

kdyrhage · January 24, 2020, 3:01pm

Same way as ID:

for ...
    ...
    gene.Parent = "$(parent(gene).name)$newID"
end

mictadlo · January 24, 2020, 3:09pm

Unfortunately, it adds a Parent ID to gene feature which did not contain a Parent before. The other problem is that it missed to change it for the below features:

transcription_end_site
transcription_start_site
stop_codon
start_codon
intron

How would it possible to fix it?

Thank you in advance,

kdyrhage · January 24, 2020, 4:16pm

The features you mention don’t have IDs, so they were skipped in the for loop. Try this instead:

using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
currentID = 0
for gene in @genes(chrs)
    if feature(gene) == :gene
        global currentID += 1
    end
    newID = lpad(currentID, 6, '0')
    if !ismissing(gene.ID)
        gene.ID = "$(parent(gene).name)$newID"
    end
    if !ismissing(gene.Parent)
        gene.Parent = "$(parent(gene).name)$newID"
    end
end
printgff("newIDs.gff", chrs)

mictadlo · January 24, 2020, 10:27pm

Thank you. Unfortunately,println("$(parent(gene).ID)") did not work to change chromosome name to a single digit in the following way:

NbV1Ch01 change to 1
NbV1Ch02 change to 2
NbV1Ch03 change to 3
NbV1Ch04 change to 4
NbV1Ch05 change to 5
NbV1Ch06 change to 6
NbV1Ch07 change to 7
NbV1Ch08 change to 8
NbV1Ch09 change to 9
NbV1Ch10 change to 10

Does GenomicAnnotations keep an index count of all available chromosomes?

Update
How is it possible to reset currentID when for each new chromosome?

Thank you in advance,

kdyrhage · January 25, 2020, 3:22pm

Chromosome names are accessed with chr.name. You can change the names separately for each chromosome.

mictadlo · January 25, 2020, 8:34pm

Thank you, How is it possible to reset currentID when for each new chromosome?

NbV1Ch01        AUGUSTUS        gene    126794454       126798741       0.63    -       .       ID=NbV1Ch01000001
...
NbV1Ch02        AUGUSTUS        gene    20516   31934   0.15    +       .       ID=NbV1Ch02000001
...
NbV1Ch02        AUGUSTUS        gene    39413   43534   0.02    -       .       ID=NbV1Ch02000002
...

Thank you in advance,

kevbonham · January 25, 2020, 9:37pm

@mictadlo while I’m sure your inquiries are will meant, this thread now has 15 or so separate questions - from the outside, it looks a lot like you’re asking @kdyrhage to do your project for you. They’ve gamely responded to all of your questions, but I’m not seeing any evidence that you’re attempting to solve your problems in your own before asking for the next thing.

Perhaps it’s none of my business, but open source communities depend on people volunteering their time, and the longevity of a community being willing to help users with problems depends on those users making a good faith effort to solve their own problems first.

I’m all for asking for help - I’m here and on slack all the time asking for help, and I encourage you to continue to use this forum when you run into a problem you can’t figure out, but this thread is getting a bit out of hand. At the very least, please mark your original question as answered and start a new thread if you have an additional problem, show what you’ve done to try to solve it on your own, and where else you’ve looked. If you don’t know where to look or how to approach solving problems on your own, that’s work asking too - that’s a skill that we’ve all had to learn at some point!

mictadlo · January 25, 2020, 11:12pm

To receiving so quick and awesome help I am very thankful. Furthermore, I understand @kevbonham’s view, and I’m sorry not to post my fail attempts to get a particular code running. For example, I tried to reset currentID variable in the following way:

currentID = 0
for gene in @genes(chrs)
    if feature(gene) == :gene
        global currentID += 1
    else
        global currentID = 0
    end

Unfortunately, the above code reset currentID variable for each gene rather than when a new chromosome starts:

NbV1Ch01        AUGUSTUS        gene    126794454       126798741       0.63    -       .       ID=NbV1Ch01000001
NbV1Ch01        AUGUSTUS        mRNA    126794454       126798741       0.33    -       .       ID=NbV1Ch01000000;Parent=NbV1Ch01000000
NbV1Ch01        AUGUSTUS        transcription_end_site  126794454       126794454       .       -       .       Parent=NbV1Ch01000000
NbV1Ch01        AUGUSTUS        three_prime_utr 126794454       126795005       0.37    -       .       ID=NbV1Ch01000000;Parent=NbV1Ch01000000
NbV1Ch01        AUGUSTUS        exon    126794454       126795005       .       -       .       ID=NbV1Ch01000000;Parent=NbV1Ch01000000
...
NbV1Ch02        AUGUSTUS        gene    20516   31934   0.15    +       .       ID=NbV1Ch02000001
NbV1Ch02        AUGUSTUS        mRNA    20516   31934   0.15    +       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        transcription_start_site        20516   20516   .       +       .       Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        five_prime_utr  20516   20532   0.39    +       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        exon    20516   20574   .       +       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
...
NbV1Ch02        AUGUSTUS        gene    39413   43534   0.02    -       .       ID=NbV1Ch02000001
NbV1Ch02        AUGUSTUS        mRNA    39413   43534   0.02    -       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        transcription_end_site  39413   39413   .       -       .       Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        three_prime_utr 39413   39477   0.57    -       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02        AUGUSTUS        exon    39413   40064   .       -       .       ID=NbV1Ch02000000;Parent=NbV1Ch02000000
...

Maybe, the awesome solutions which I recieved could be used to add to GenomicAnnotations.jl’s documentation?

kevbonham · January 26, 2020, 1:59pm

I’m guessing that that if you’re up to making a PR to add them, it would be welcome.

kevbonham · January 26, 2020, 5:10pm

What have you tried to solve or get more info about the problem? Have you used print statements to check what the values of gene, feature(gene), and currentID are on each loop? Often when you have these kinds of problems, stepping through a couple runs of your loop and checking the values will reveal the problem.

Ward9250 · January 27, 2020, 1:32am

I see you didn’t use the FSM approach that other BioJulia parsers use.

To be fair it is difficult at first. I haven’t bothered framing my GFA parsing and stuff using Automa yet since it’s not a priority for me. But I know how to do this with Automa if a FSM adaptation of GenomicAnnotations parser is desirable.

Topic		Replies	Views
Current state of PopGen.jl Biology, Health, and Medicine	18	1458	September 23, 2020
Bioinitio Project? Biology, Health, and Medicine proposal	1	610	October 8, 2019
Bioinformatics.jl - Helper functions for bioinformatics and rosalind.info problems Biology, Health, and Medicine	1	882	August 13, 2019
ANN - Towards a (Bio)Julia powered Genome Graphs framework Biology, Health, and Medicine announcement , graphs	25	4265	September 24, 2022
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1395	July 17, 2023

Renaming IDs in a GFF3

Related topics