Yes, there are FASTA but I am not interested in because it only represents some sequences and not all the known database. I mean that I am interested in meta-data (which Exon or Intron is sequenced, the coordinates, etc.).
Other said, some alleles are not in the FASTA files because e.g., the sequence does not have the full genomic sequence known yet. For instance, consider the incomplete A*01:239 for which only the exons 2 and 3 are sequenced and nothing more; so it is in hla.dat and not in hla_gen.fasta and from hla_nuc.fasta you cannot deduce the length of the exon 2 or exon 3.
If you want specific pieces of data then iterating over the lines with eachline and extracting what you want seems like the best way to go.
For example, you could extract exon coordinate data with something like:
exons = Dict{String, Vector{Tuple{Int, Int}}}()
open("hla.dat") do f
current_id = ""
for line in eachline(f)
# Read ID
if startswith(line, "ID")
current_id = split(line[3:end])[1][1:(end-1)]
# Find exon lines
elseif startswith(line, "FT") && split(line)[2] == "exon"
# Find exon start/end
exon_start, exon_end = parse.(Int, split(split(line)[3], ".."))
if haskey(exons, current_id)
push!(exons[current_id], (exon_start, exon_end))
else
exons[current_id] = [(exon_start, exon_end)]
end
end
end
end