Does Bio support EMBL and/or IMGT/HLA format ?

Dear,

I am not able to find in the doc if the metapackage Bio supports the format EMBL and/or the specific one about HLA database [1] (hla.dat).

The wiki [2] mentions EMBL but is it accurate ?

Thank you in advance for any pointer and/or comment ?

All the best,
simon

[1] IPD-IMGT/HLA Database
[2] core features · BioJulia/Bio.jl Wiki · GitHub

If it did it would probably be via the BioSequences.jl package, but I don’t think it’s implemented.

The wiki is pretty out of date now - thankfully, many of the proposed features are now available (though perhaps not the one you want).

What data are you trying to open exactly ? It seems they have fasta files available, which you can open without problems with BioSequences.

ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/

Yes, there are FASTA but I am not interested in because it only represents some sequences and not all the known database. I mean that I am interested in meta-data (which Exon or Intron is sequenced, the coordinates, etc.).

Other said, some alleles are not in the FASTA files because e.g., the sequence does not have the full genomic sequence known yet. For instance, consider the incomplete A*01:239 for which only the exons 2 and 3 are sequenced and nothing more; so it is in hla.dat and not in hla_gen.fasta and from hla_nuc.fasta you cannot deduce the length of the exon 2 or exon 3.

Basically, all this information is in this file:
https://media.githubusercontent.com/media/ANHIG/IMGTHLA/Latest/hla.dat
which is a variant of EMBL format.

If you want specific pieces of data then iterating over the lines with eachline and extracting what you want seems like the best way to go.

For example, you could extract exon coordinate data with something like:

exons = Dict{String, Vector{Tuple{Int, Int}}}()

open("hla.dat") do f
    current_id = ""
    for line in eachline(f)
        # Read ID
        if startswith(line, "ID")
            current_id = split(line[3:end])[1][1:(end-1)]
        # Find exon lines
        elseif startswith(line, "FT") && split(line)[2] == "exon"
            # Find exon start/end
            exon_start, exon_end = parse.(Int, split(split(line)[3], ".."))
            if haskey(exons, current_id)
                push!(exons[current_id], (exon_start, exon_end))
            else
                exons[current_id] = [(exon_start, exon_end)]
            end
        end
    end
end

Warning: not tested and may not be what you want.

Thank you for the chunk.

If I am able to schedule some time, then I will try to improve the readers to add IMGT/HLA and/or EMBL.
In the meantime, I am using BioPython. :wink:

All the best

1 Like

Cool. If you do make any progress it would be a useful contribution to BioJulia.

That would be great! Contributions like this are definitely welcomed