Renaming IDs in a GFF3

mictadlo · January 23, 2020, 8:15am

Hi,
are there any examples of renaming IDs in a GFF3 file?

Thank you in advance,

kdyrhage · January 23, 2020, 10:36am

If you can convert it to GenBank format you can do it easily with GenomicAnnotations.jl. Adding GFF support is on my todo-list, but as I have no use for it myself it has low priority.

kevbonham · January 23, 2020, 1:49pm

Woah, when did a gbk parser get added? I’ve been meaning to do this for like 3 years . Is it written in julia or wrapping something else?

kdyrhage · January 23, 2020, 2:25pm

It’s pure Julia. I think I wrote the first version almost three years ago now! It’s only been public for about a year I think, though. I guess I haven’t advertised* it very well. The next version will be in BioJuliaRegistry instead of General, which should make it easier to find.

(*speaking of which, the whole reason I started GenomicAnnotations.jl was to make GenomicMaps.jl)

I wrote a quick parser for GFF files (which, surprise surprise, was much easier than for GenBank files), and it seems to work just fine. Unless I run into any problems I’ll add a way to write to GFF and upload it.

kevbonham · January 23, 2020, 2:29pm

Neat! I see you didn’t use the FSM approach that other BioJulia parsers use. That probably makes sense, (that’s what I tinkered with and it was rough), but have you benchmarked it against some other language parsers?

There’s a guy on a forum I’m part of that always brings up the need for a genbank parser before he’ll even consider trying a language out - he suggested testing against the suite that BioPython uses. Let me know if you’re interested in a PR to add those tests, I’d love to be able to go to him with this in hand

kdyrhage · January 23, 2020, 2:37pm

I haven’t compared it to other parsers. I try to optimise things when I notice that something is performing poorly, but I only work with bacterial genomes so my needs aren’t that great.

I tried rewriting the parser using Automa.jl at some point, but once it got too complex it just wouldn’t compile anymore so I gave up on that idea.

Absolutely, any form of contribution is welcome!

kdyrhage · January 23, 2020, 3:55pm

I have now added a GFF parser. It worked for the file I tested it on, at least. Currently it’s on the branch “parsegff”, so you can install it with:

(v1.3) pkg> add GenomicAnnotations#parsegff

Use readgff(filepath) to read the file, and printgff(filepath, annotations) to write them to a file after modifying the IDs. The documentation for GenomicAnnotations explains how to modify the data.

mictadlo · January 24, 2020, 3:24am

Thank you for such a quick implementation. Unfortunately, I ran into this problem:

julia> using GenomicAnnotations

julia> chr=readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
ERROR: BoundsError: attempt to access 1-element Array{SubString{String},1} at index [2]
Stacktrace:
 [1] indexed_iterate at ./array.jl:744 [inlined]
 [2] parsechromosome_gff(::Array{String,1}, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:261
 [3] #readgff#33(::Bool, ::typeof(readgff), ::IOStream, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:316
 [4] readgff at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:300 [inlined]
 [5] #30 at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:295 [inlined]
 [6] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::GenomicAnnotations.var"#30#32"{DataType}, ::String) at ./io.jl:298
 [7] open(::Function, ::String) at ./io.jl:296
 [8] #readgff#28(::Bool, ::typeof(readgff), ::String, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:295
 [9] readgff at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:291 [inlined] (repeats 2 times)
 [10] top-level scope at none:0

I used Braker2’s GFF3 file:

NbV1Ch08    AUGUSTUS    gene    7015    29794   0.01    -   .   ID=g1;
NbV1Ch08    AUGUSTUS    mRNA    7015    29794   0.01    -   .   ID=g1.t1;Parent=g1
NbV1Ch08    AUGUSTUS    transcription_end_site  7015    7015    .   -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    three_prime_utr 7015    8531    0.2 -   .   ID=g1.t1.3UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    7015    8747    .   -   .   ID=g1.t1.exon1;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    stop_codon  8532    8534    .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 8532    8747    0.31    -   0   ID=g1.t1.CDS1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    intron  8748    9191    0.49    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9192    9342    0.66    -   1   ID=g1.t1.CDS2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9192    9342    .   -   .   ID=g1.t1.exon2;Parent=g1.t1;
bash-3.2$ head -n 100 /Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3
NbV1Ch08    AUGUSTUS    gene    7015    29794   0.01    -   .   ID=g1;
NbV1Ch08    AUGUSTUS    mRNA    7015    29794   0.01    -   .   ID=g1.t1;Parent=g1
NbV1Ch08    AUGUSTUS    transcription_end_site  7015    7015    .   -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    three_prime_utr 7015    8531    0.2 -   .   ID=g1.t1.3UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    7015    8747    .   -   .   ID=g1.t1.exon1;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    stop_codon  8532    8534    .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 8532    8747    0.31    -   0   ID=g1.t1.CDS1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    intron  8748    9191    0.49    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9192    9342    0.66    -   1   ID=g1.t1.CDS2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9192    9342    .   -   .   ID=g1.t1.exon2;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  9343    9915    0.58    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9916    10006   0.71    -   2   ID=g1.t1.CDS3;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9916    10006   .   -   .   ID=g1.t1.exon3;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  10007   10101   0.74    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 10102   10201   0.78    -   0   ID=g1.t1.CDS4;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    10102   10201   .   -   .   ID=g1.t1.exon4;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  10202   10712   0.8 -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 10713   11107   0.11    -   2   ID=g1.t1.CDS5;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    10713   11107   .   -   .   ID=g1.t1.exon5;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  11108   11569   0.07    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 11570   12151   0.09    -   2   ID=g1.t1.CDS6;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    11570   12151   .   -   .   ID=g1.t1.exon6;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  12152   12588   0.34    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 12589   12717   0.39    -   2   ID=g1.t1.CDS7;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    12589   12717   .   -   .   ID=g1.t1.exon7;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  12718   12789   0.42    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 12790   13075   0.39    -   0   ID=g1.t1.CDS8;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    12790   13075   .   -   .   ID=g1.t1.exon8;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  13076   14832   0.51    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 14833   15009   0.39    -   0   ID=g1.t1.CDS9;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    14833   15009   .   -   .   ID=g1.t1.exon9;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15010   15278   0.59    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15279   15415   0.56    -   2   ID=g1.t1.CDS10;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15279   15415   .   -   .   ID=g1.t1.exon10;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15416   15487   0.58    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15488   15612   0.96    -   1   ID=g1.t1.CDS11;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15488   15612   .   -   .   ID=g1.t1.exon11;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15613   15706   0.96    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15707   15957   0.98    -   0   ID=g1.t1.CDS12;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15707   15958   .   -   .   ID=g1.t1.exon12;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    start_codon 15955   15957   .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    five_prime_utr  15958   15958   0.99    -   .   ID=g1.t1.5UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    five_prime_utr  27458   28250   0.37    -   .   ID=g1.t1.5UTR2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    27458   28250   .   -   .   ID=g1.t1.exon13;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    five_prime_utr  29272   29794   0.08    -   .   ID=g1.t1.5UTR3;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    29272   29794   .   -   .   ID=g1.t1.exon14;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    transcription_start_site    29794   29794   .   -   .   Parent=g1.t1;

What did I miss?

Thank you in advance,

mictadlo · January 24, 2020, 3:26am

Thank you for such a quick implementation. Unfortunately, I ran into this problem:

julia> using GenomicAnnotations

julia> chr=readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
ERROR: BoundsError: attempt to access 1-element Array{SubString{String},1} at index [2]
Stacktrace:
 [1] indexed_iterate at ./array.jl:744 [inlined]
 [2] parsechromosome_gff(::Array{String,1}, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:261
 [3] #readgff#33(::Bool, ::typeof(readgff), ::IOStream, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:316
 [4] readgff at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:300 [inlined]
 [5] #30 at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:295 [inlined]
 [6] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::GenomicAnnotations.var"#30#32"{DataType}, ::String) at ./io.jl:298
 [7] open(::Function, ::String) at ./io.jl:296
 [8] #readgff#28(::Bool, ::typeof(readgff), ::String, ::Type) at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:295
 [9] readgff at /Users/lorencm/.julia/packages/GenomicAnnotations/Y7qTk/src/readgbk.jl:291 [inlined] (repeats 2 times)
 [10] top-level scope at none:0

I used Braker2’s GFF3 file:

NbV1Ch08    AUGUSTUS    gene    7015    29794   0.01    -   .   ID=g1;
NbV1Ch08    AUGUSTUS    mRNA    7015    29794   0.01    -   .   ID=g1.t1;Parent=g1
NbV1Ch08    AUGUSTUS    transcription_end_site  7015    7015    .   -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    three_prime_utr 7015    8531    0.2 -   .   ID=g1.t1.3UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    7015    8747    .   -   .   ID=g1.t1.exon1;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    stop_codon  8532    8534    .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 8532    8747    0.31    -   0   ID=g1.t1.CDS1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    intron  8748    9191    0.49    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9192    9342    0.66    -   1   ID=g1.t1.CDS2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9192    9342    .   -   .   ID=g1.t1.exon2;Parent=g1.t1;
bash-3.2$ head -n 100 /Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3
NbV1Ch08    AUGUSTUS    gene    7015    29794   0.01    -   .   ID=g1;
NbV1Ch08    AUGUSTUS    mRNA    7015    29794   0.01    -   .   ID=g1.t1;Parent=g1
NbV1Ch08    AUGUSTUS    transcription_end_site  7015    7015    .   -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    three_prime_utr 7015    8531    0.2 -   .   ID=g1.t1.3UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    7015    8747    .   -   .   ID=g1.t1.exon1;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    stop_codon  8532    8534    .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 8532    8747    0.31    -   0   ID=g1.t1.CDS1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    intron  8748    9191    0.49    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9192    9342    0.66    -   1   ID=g1.t1.CDS2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9192    9342    .   -   .   ID=g1.t1.exon2;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  9343    9915    0.58    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 9916    10006   0.71    -   2   ID=g1.t1.CDS3;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    9916    10006   .   -   .   ID=g1.t1.exon3;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  10007   10101   0.74    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 10102   10201   0.78    -   0   ID=g1.t1.CDS4;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    10102   10201   .   -   .   ID=g1.t1.exon4;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  10202   10712   0.8 -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 10713   11107   0.11    -   2   ID=g1.t1.CDS5;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    10713   11107   .   -   .   ID=g1.t1.exon5;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  11108   11569   0.07    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 11570   12151   0.09    -   2   ID=g1.t1.CDS6;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    11570   12151   .   -   .   ID=g1.t1.exon6;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  12152   12588   0.34    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 12589   12717   0.39    -   2   ID=g1.t1.CDS7;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    12589   12717   .   -   .   ID=g1.t1.exon7;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  12718   12789   0.42    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 12790   13075   0.39    -   0   ID=g1.t1.CDS8;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    12790   13075   .   -   .   ID=g1.t1.exon8;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  13076   14832   0.51    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 14833   15009   0.39    -   0   ID=g1.t1.CDS9;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    14833   15009   .   -   .   ID=g1.t1.exon9;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15010   15278   0.59    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15279   15415   0.56    -   2   ID=g1.t1.CDS10;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15279   15415   .   -   .   ID=g1.t1.exon10;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15416   15487   0.58    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15488   15612   0.96    -   1   ID=g1.t1.CDS11;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15488   15612   .   -   .   ID=g1.t1.exon11;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    intron  15613   15706   0.96    -   .   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    CDS 15707   15957   0.98    -   0   ID=g1.t1.CDS12;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    15707   15958   .   -   .   ID=g1.t1.exon12;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    start_codon 15955   15957   .   -   0   Parent=g1.t1;
NbV1Ch08    AUGUSTUS    five_prime_utr  15958   15958   0.99    -   .   ID=g1.t1.5UTR1;Parent=g1.t1
NbV1Ch08    AUGUSTUS    five_prime_utr  27458   28250   0.37    -   .   ID=g1.t1.5UTR2;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    27458   28250   .   -   .   ID=g1.t1.exon13;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    five_prime_utr  29272   29794   0.08    -   .   ID=g1.t1.5UTR3;Parent=g1.t1
NbV1Ch08    AUGUSTUS    exon    29272   29794   .   -   .   ID=g1.t1.exon14;Parent=g1.t1;
NbV1Ch08    AUGUSTUS    transcription_start_site    29794   29794   .   -   .   Parent=g1.t1;

What did I miss?

Thank you in advance,

kdyrhage · January 24, 2020, 8:28am

I don’t know whether it’s against the GFF3 specifications or not, but the problem stems from the trailing semicolons. Either way I added a fix. After updating, you should be able to change the IDs with something like:

using GenomicAnnotations
chrs = readgff(filepath)
for (i, gene) in enumerate(@genes(chrs, !ismissing(:ID)))
    gene.ID = "newid_$(string(i, pad=4))"
end
printgff(newfilepath, chrs)

It’s not important for the parser, but the file is supposed to have a header specifying the GFF version, so it is not following the specifications.

mictadlo · January 24, 2020, 8:48am

Thank you, but while updating I got an error:

(v1.3) pkg> update GenomicAnnotations#parsegff
ERROR: invalid token

What did I miss?
Thank you in advance,

kdyrhage · January 24, 2020, 8:52am

I think just

(v1.3) pkg> up

should work. If not, try:

(v1.3) pkg> rm GenomicAnnotations
(v1.3) pkg> add GenomicAnnotations#parsegff

mictadlo · January 24, 2020, 9:27am

Thank up worked but now I got a new error:

julia> using GenomicAnnotations
[ Info: Precompiling GenomicAnnotations [4f8a0a0a-376d-5ac0-ab14-e88793df67f0]

julia> chr=readgff(“/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3”)
1-element Array{Chromosome{Gene},1}:
Chromosome ‘NbV1Ch08’ (0 bp) with 136090 annotations

julia> print(chr)
Chromosome{Gene}[Chromosome ‘NbV1Ch08’ (0 bp) with 136090 annotations
]
julia> for gene in chr.genes
print(gene)
end
ERROR: type Array has no field genes
Stacktrace:
[1] getproperty(::Array{Chromosome{Gene},1}, ::Symbol) at ./Base.jl:20
[2] top-level scope at ./REPL[13]:1

What did I miss?

kdyrhage · January 24, 2020, 9:45am

GBK/GFF files can contain multiple chromosomes, so readgff returns an array. There are multiple ways you can deal with this, so pick the one that suits you best. You can:

use the macro @genes(chrs) to iterate over the genes from all chromosomes:

for gene in @genes(chrs)
    ...
end

iterate over the chromosomes separately:

for chr in chrs
    for gene in chr.genes
        ...
    end
end

store only one chromosome in chr:

chr = readgff(filepath)[1]
for gene in chr.genes
    ....
end

I recommend option 1, and if your file only contains one chromosome option 3 (@genes works on individual Chromosomes or arrays of Chromosomes, so you can combine the two).

mictadlo · January 24, 2020, 10:46am

Thank you. I used the following code but I get unexpected output:

NbV1Ch08_missing
NbV1Ch08_missing
NbV1Ch08_missing
NbV1Ch08_missing
NbV1Ch08_missing
NbV1Ch08_missing

with the below code:

using GenomicAnnotations

chrs=readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")

for chr in chrs
    for gene in chr.genes
        println("$(chr.name)_$(gene.id)")
    end
end

How can I access the gene’s ids?

Thank you in advance,

kdyrhage · January 24, 2020, 10:55am

Attribute names are case-sensitive, so use gene.ID, not gene.id. Judging by the partial file you posted, not all entries have an ID, so I recommend using the version with @genes that I posted earlier:

for gene in @genes(chr, !ismissing(:ID))
    println("$(chr.name)_$(gene.ID)")
end

This will iterate over the entries that do have and ID. Otherwise, for entries that lack an ID gene.ID will, again, return missing.

mictadlo · January 24, 2020, 12:09pm

Thank you but I ran into a new error:

julia> chr=readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
1-element Array{Chromosome{Gene},1}:
 Chromosome 'NbV1Ch08' (0 bp) with 136090 annotations


julia> for gene in @genes(chr, !ismissing(:ID))
           println("$(chr.name)_$(gene.ID)")
       end
ERROR: type Array has no field name
Stacktrace:
 [1] getproperty(::Array{Chromosome{Gene},1}, ::Symbol) at ./Base.jl:20
 [2] top-level scope at /Users/lorencm/.julia/packages/GenomicAnnotations/4kJOh/src/macro.jl:2

Please find here a GFF3 for one chromosome.

Thank you in advance

kdyrhage · January 24, 2020, 12:22pm

Your chr is still an Array{Chromosome}. In this case there is only one chromosome, so you can access the name with chr[1].name. For a more general solution, use parent(gene).name. parent(gene::Gene) returns the Chromosome that contains gene.
The following example will work for a GFF file with any number of chromosomes:

using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
for gene in @genes(chrs, !ismissing(:ID))
    println("$(parent(gene).name)_$(gene.ID)")
end

mictadlo · January 24, 2020, 1:23pm

Thank you. I added a counter which lead that each feature gets a new ID rather each gene.

NbV1Ch08_g5742.t1_078390
NbV1Ch08_g5742.t1.3UTR1_078391
NbV1Ch08_g5742.t1.exon1_078392
NbV1Ch08_g5742.t1.CDS1_078393
NbV1Ch08_g5742.t1.CDS2_078394
NbV1Ch08_g5742.t1.exon2_078395
NbV1Ch08_g5742.t1.CDS3_078396
NbV1Ch08_g5742.t1.exon3_078397
NbV1Ch08_g5742.t1.CDS4_078398
NbV1Ch08_g5742.t1.exon4_078399
NbV1Ch08_g5742.t1.CDS5_078400
NbV1Ch08_g5742.t1.exon5_078401
NbV1Ch08_g5742.t1.CDS6_078402
NbV1Ch08_g5742.t1.exon6_078403
NbV1Ch08_g5742.t1.CDS7_078404
NbV1Ch08_g5742.t1.exon7_078405
NbV1Ch08_g5742.t1.CDS8_078406
NbV1Ch08_g5742.t1.exon8_078407
NbV1Ch08_g5742.t1.CDS9_078408
NbV1Ch08_g5742.t1.exon9_078409
NbV1Ch08_g5742.t1.CDS10_078410
NbV1Ch08_g5742.t1.exon10_078411
NbV1Ch08_g5742.t1.CDS11_078412
NbV1Ch08_g5742.t1.exon11_078413
NbV1Ch08_g5742.t1.CDS12_078414
NbV1Ch08_g5742.t1.exon12_078415
NbV1Ch08_g5742.t1.CDS13_078416
NbV1Ch08_g5742.t1.exon13_078417
NbV1Ch08_g5742.t1.CDS14_078418
NbV1Ch08_g5742.t1.exon14_078419
NbV1Ch08_g5742.t1.CDS15_078420
NbV1Ch08_g5742.t1.exon15_078421
NbV1Ch08_g5742.t1.5UTR1_078422
NbV1Ch08_g5742.t2_078423
NbV1Ch08_g5742.t2.3UTR1_078424
NbV1Ch08_g5742.t2.exon1_078425
NbV1Ch08_g5742.t2.CDS1_078426
NbV1Ch08_g5742.t2.CDS2_078427
NbV1Ch08_g5742.t2.exon2_078428
NbV1Ch08_g5742.t2.CDS3_078429
NbV1Ch08_g5742.t2.exon3_078430
NbV1Ch08_g5742.t2.CDS4_078431
NbV1Ch08_g5742.t2.exon4_078432
NbV1Ch08_g5742.t2.CDS5_078433
NbV1Ch08_g5742.t2.exon5_078434
NbV1Ch08_g5742.t2.CDS6_078435
NbV1Ch08_g5742.t2.exon6_078436
NbV1Ch08_g5742.t2.5UTR1_078437

I would have expected to see:

NbV1Ch08_g5742.t1_078390
NbV1Ch08_g5742.t1.3UTR1_078390
NbV1Ch08_g5742.t1.exon1_078390
NbV1Ch08_g5742.t1.CDS1_078390
...

Here is the update code:

using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
for (count, gene) in enumerate(@genes(chrs, !ismissing(:ID)))
    newID = lpad(count, 6, '0')
    println("$(parent(gene).name)_$(gene.ID)_$(newID)")
    
end

Is there a better way to implement a counter?

Thank you in advance,

kdyrhage · January 24, 2020, 1:32pm

Something like this?

using GenomicAnnotations
chrs = readgff("/Users/lorencm/projects/bioinf-scripts/data/NbV1Ch08-augustus.hints_utr.gff3")
currentID = 0
for gene in @genes(chrs, !ismissing(:ID))
    if feature(gene) == :gene
        global currentID += 1
    end
    newID = lpad(currentID, 6, '0')
    println("$(parent(gene).name)_$(gene.ID)_$newID")
end

Topic		Replies	Views
Streaming gzipped BCF files: issue with CHROM field Biology, Health, and Medicine	1	543	June 12, 2019
How to get location of sequence in GenomicAnnotations Biology, Health, and Medicine question	2	257	January 3, 2024
ANN - Towards a (Bio)Julia powered Genome Graphs framework Biology, Health, and Medicine announcement , graphs	25	4269	September 24, 2022
lhe/Biofast benchmark \| FASTQ parsing [Julia,Nim,Crystal,Python,...] Biology, Health, and Medicine performance , benchmark , community , biology	54	8593	May 25, 2020
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1403	July 17, 2023

Renaming IDs in a GFF3

Related topics