Does Bio support EMBL and/or IMGT/HLA format ?

zimoun · March 30, 2019, 1:23pm

Dear,

I am not able to find in the doc if the metapackage Bio supports the format EMBL and/or the specific one about HLA database [1] (hla.dat).

The wiki [2] mentions EMBL but is it accurate ?

Thank you in advance for any pointer and/or comment ?

All the best,
simon

[1] IPD-IMGT/HLA Database
[2] core features · BioJulia/Bio.jl Wiki · GitHub

jgreener64 · April 1, 2019, 10:30am

If it did it would probably be via the BioSequences.jl package, but I don’t think it’s implemented.

The wiki is pretty out of date now - thankfully, many of the proposed features are now available (though perhaps not the one you want).

jonathanBieler · April 1, 2019, 12:23pm

What data are you trying to open exactly ? It seems they have fasta files available, which you can open without problems with BioSequences.

ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/

zimoun · April 1, 2019, 1:40pm

Yes, there are FASTA but I am not interested in because it only represents some sequences and not all the known database. I mean that I am interested in meta-data (which Exon or Intron is sequenced, the coordinates, etc.).

Other said, some alleles are not in the FASTA files because e.g., the sequence does not have the full genomic sequence known yet. For instance, consider the incomplete A*01:239 for which only the exons 2 and 3 are sequenced and nothing more; so it is in hla.dat and not in hla_gen.fasta and from hla_nuc.fasta you cannot deduce the length of the exon 2 or exon 3.

Basically, all this information is in this file:
https://media.githubusercontent.com/media/ANHIG/IMGTHLA/Latest/hla.dat
which is a variant of EMBL format.

jgreener64 · April 1, 2019, 2:11pm

If you want specific pieces of data then iterating over the lines with eachline and extracting what you want seems like the best way to go.

For example, you could extract exon coordinate data with something like:

exons = Dict{String, Vector{Tuple{Int, Int}}}()

open("hla.dat") do f
    current_id = ""
    for line in eachline(f)
        # Read ID
        if startswith(line, "ID")
            current_id = split(line[3:end])[1][1:(end-1)]
        # Find exon lines
        elseif startswith(line, "FT") && split(line)[2] == "exon"
            # Find exon start/end
            exon_start, exon_end = parse.(Int, split(split(line)[3], ".."))
            if haskey(exons, current_id)
                push!(exons[current_id], (exon_start, exon_end))
            else
                exons[current_id] = [(exon_start, exon_end)]
            end
        end
    end
end

Warning: not tested and may not be what you want.

zimoun · April 1, 2019, 3:20pm

Thank you for the chunk.

If I am able to schedule some time, then I will try to improve the readers to add IMGT/HLA and/or EMBL.
In the meantime, I am using BioPython.

All the best

jgreener64 · April 1, 2019, 3:23pm

Cool. If you do make any progress it would be a useful contribution to BioJulia.

kevbonham · April 1, 2019, 5:03pm

That would be great! Contributions like this are definitely welcomed

Topic		Replies	Views
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022
Request for comments: Upcoming Kmers.jl version 1.0 Biology, Health, and Medicine	4	505	December 23, 2024
[ANN] BioRecordsProcessing.jl - Easily process your biological records Package Announcements biology	0	353	October 21, 2021
Reading/parsing .ab1 sequencing files Biology, Health, and Medicine question	10	3407	May 29, 2019
Large FASTA datasets? Biology, Health, and Medicine	7	1273	December 15, 2022

Does Bio support EMBL and/or IMGT/HLA format ?

Related topics