Thank you. How is it possible to replace the current ID (g4510) with NbV1Ch08004510, NbV1Ch08004510.t1, NbV1Ch08004510t1.3UTR1, …?
NbV1Ch08_g4510_004510
NbV1Ch08_g4510.t1_004510
NbV1Ch08_g4510.t1.3UTR1_004510
NbV1Ch08_g4510.t1.exon1_004510
NbV1Ch08_g4510.t1.CDS1_004510
NbV1Ch08_g4510.t1.CDS2_004510
NbV1Ch08_g4510.t1.exon2_004510
NbV1Ch08_g4510.t1.5UTR1_004510
Thank you in advance,
Remove _$(gene.ID)_
from the println
line.
1 Like
Sorry, I have forgotten to mention that I would like to write the new ID into a new Gff3 file. Could you please show me how to replace the old ID with the new one and write the changes to a new GFF3 file?
Thank you in advance,
using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
currentID = 0
for gene in @genes(chrs, !ismissing(:ID))
if feature(gene) == :gene
global currentID += 1
end
newID = lpad(currentID, 6, '0')
gene.ID = "$(parent(gene).name)$newID"
end
printgff("newIDs.gff", chrs)
1 Like
Thank you but newIDs.gff
still contain the old IDs.
What could cause it?
Thank you in advance,
Not when I run it exactly as I wrote it in my last post. The important part is this line:
gene.ID = "$(parent(gene).name)$newID"
Are you sure that you ran it?
1 Like
I copied your code from your previous post and it appears it did not change the IDs. I am not quite sure what could go wrong?
When I run the code I posted, newIDs.gff contains this (first 5 lines):
##gff-version 3
NbV1Ch08 AUGUSTUS gene 7015 29794 0.01 - . ID=NbV1Ch08000001
NbV1Ch08 AUGUSTUS mRNA 7015 29794 0.01 - . ID=NbV1Ch08000001;Parent=g1
NbV1Ch08 AUGUSTUS transcription_end_site 7015 7015 . - . Parent=g1.t1
NbV1Ch08 AUGUSTUS three_prime_utr 7015 8531 0.2 - . ID=NbV1Ch08000001;Parent=g1.t1
If you don’t get this my best guess is that you aren’t running the previous version where it said println(...)
instead of gene.ID = ...
.
1 Like
Thank you, I closed everything and reopen and new IDs appeared in the new GFF3 file. However, I noticed that e.g. Parent=g1
has not been updated to Parent=NbV1Ch08000001
. How is it possible to change Parent ID?
Thank you in advance,
Same way as ID:
for ...
...
gene.Parent = "$(parent(gene).name)$newID"
end
1 Like
Unfortunately, it adds a Parent ID to gene feature which did not contain a Parent before. The other problem is that it missed to change it for the below features:
- transcription_end_site
- transcription_start_site
- stop_codon
- start_codon
- intron
How would it possible to fix it?
Thank you in advance,
The features you mention don’t have IDs, so they were skipped in the for loop. Try this instead:
using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
currentID = 0
for gene in @genes(chrs)
if feature(gene) == :gene
global currentID += 1
end
newID = lpad(currentID, 6, '0')
if !ismissing(gene.ID)
gene.ID = "$(parent(gene).name)$newID"
end
if !ismissing(gene.Parent)
gene.Parent = "$(parent(gene).name)$newID"
end
end
printgff("newIDs.gff", chrs)
Thank you. Unfortunately,println("$(parent(gene).ID)")
did not work to change chromosome name to a single digit in the following way:
NbV1Ch01 change to 1
NbV1Ch02 change to 2
NbV1Ch03 change to 3
NbV1Ch04 change to 4
NbV1Ch05 change to 5
NbV1Ch06 change to 6
NbV1Ch07 change to 7
NbV1Ch08 change to 8
NbV1Ch09 change to 9
NbV1Ch10 change to 10
Does GenomicAnnotations
keep an index count of all available chromosomes?
Update
How is it possible to reset currentID
when for each new chromosome?
Thank you in advance,
Chromosome names are accessed with chr.name
. You can change the names separately for each chromosome.
Thank you, How is it possible to reset currentID
when for each new chromosome?
NbV1Ch01 AUGUSTUS gene 126794454 126798741 0.63 - . ID=NbV1Ch01000001
...
NbV1Ch02 AUGUSTUS gene 20516 31934 0.15 + . ID=NbV1Ch02000001
...
NbV1Ch02 AUGUSTUS gene 39413 43534 0.02 - . ID=NbV1Ch02000002
...
Thank you in advance,
@mictadlo while I’m sure your inquiries are will meant, this thread now has 15 or so separate questions - from the outside, it looks a lot like you’re asking @kdyrhage to do your project for you. They’ve gamely responded to all of your questions, but I’m not seeing any evidence that you’re attempting to solve your problems in your own before asking for the next thing.
Perhaps it’s none of my business, but open source communities depend on people volunteering their time, and the longevity of a community being willing to help users with problems depends on those users making a good faith effort to solve their own problems first.
I’m all for asking for help - I’m here and on slack all the time asking for help, and I encourage you to continue to use this forum when you run into a problem you can’t figure out, but this thread is getting a bit out of hand. At the very least, please mark your original question as answered and start a new thread if you have an additional problem, show what you’ve done to try to solve it on your own, and where else you’ve looked. If you don’t know where to look or how to approach solving problems on your own, that’s work asking too - that’s a skill that we’ve all had to learn at some point!
2 Likes
To receiving so quick and awesome help I am very thankful. Furthermore, I understand @kevbonham’s view, and I’m sorry not to post my fail attempts to get a particular code running. For example, I tried to reset currentID
variable in the following way:
currentID = 0
for gene in @genes(chrs)
if feature(gene) == :gene
global currentID += 1
else
global currentID = 0
end
Unfortunately, the above code reset currentID
variable for each gene rather than when a new chromosome starts:
NbV1Ch01 AUGUSTUS gene 126794454 126798741 0.63 - . ID=NbV1Ch01000001
NbV1Ch01 AUGUSTUS mRNA 126794454 126798741 0.33 - . ID=NbV1Ch01000000;Parent=NbV1Ch01000000
NbV1Ch01 AUGUSTUS transcription_end_site 126794454 126794454 . - . Parent=NbV1Ch01000000
NbV1Ch01 AUGUSTUS three_prime_utr 126794454 126795005 0.37 - . ID=NbV1Ch01000000;Parent=NbV1Ch01000000
NbV1Ch01 AUGUSTUS exon 126794454 126795005 . - . ID=NbV1Ch01000000;Parent=NbV1Ch01000000
...
NbV1Ch02 AUGUSTUS gene 20516 31934 0.15 + . ID=NbV1Ch02000001
NbV1Ch02 AUGUSTUS mRNA 20516 31934 0.15 + . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS transcription_start_site 20516 20516 . + . Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS five_prime_utr 20516 20532 0.39 + . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS exon 20516 20574 . + . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
...
NbV1Ch02 AUGUSTUS gene 39413 43534 0.02 - . ID=NbV1Ch02000001
NbV1Ch02 AUGUSTUS mRNA 39413 43534 0.02 - . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS transcription_end_site 39413 39413 . - . Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS three_prime_utr 39413 39477 0.57 - . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
NbV1Ch02 AUGUSTUS exon 39413 40064 . - . ID=NbV1Ch02000000;Parent=NbV1Ch02000000
...
Maybe, the awesome solutions which I recieved could be used to add to GenomicAnnotations.jl
’s documentation?
I’m guessing that that if you’re up to making a PR to add them, it would be welcome.
What have you tried to solve or get more info about the problem? Have you used print statements to check what the values of gene
, feature(gene)
, and currentID
are on each loop? Often when you have these kinds of problems, stepping through a couple runs of your loop and checking the values will reveal the problem.
I see you didn’t use the FSM approach that other BioJulia parsers use.
To be fair it is difficult at first. I haven’t bothered framing my GFA parsing and stuff using Automa yet since it’s not a priority for me. But I know how to do this with Automa if a FSM adaptation of GenomicAnnotations parser is desirable.