Hi all, I need some help parsing a file, I can’t figure out how to do it. My data looks like this (it’s a 4gb file):
101m A 1 MET 0 P02185 M 1
101m A 2 VAL 1 P02185 V 2
101m A 3 LEU 2 P02185 L 3
101m A 4 SER 3 P02185 S 4
102l A 21 THR 21 P00720 T 26
102l A 22 GLU 22 P00720 E 27
102l A 23 GLY 23 P00720 G 28
1c1c A 189 SER 191
1c1c A 190 ASP 192
And I need to tidy-it somehow, and convert it to a DafaFrame. I have previusly used push!
for this, and I have a kind-of working example, but the problem is that I need to create a not so redudant DataFrame, that should look like this:
|C1 |C2|C3 |C4 | C5|
|----|-|------|---------|--------|
|101m|A|P02185|[1,2,3,4]|[1,2,3,4]|
|102l|A|P00720|[21,22,23]|[26,27,28]|
|1c1c|A|missing|missing|missing|
So, what I want is to join into arrays all the columns that have the same identifier (the first column), but I am strugling with two things: first, how to to that, and second, how to manage the missing data.
I was trying to build a simple function like this:
function parsefile(filename)
l = readlines(filename)
sl = split.(l)
if length.(sl) == 8
return (
PDB=getindex.(sl,1),
Chain=join.(getindex.(sl,2)),
PDB_index=parse.(Float64,getindex.(sl, 3)),
PDB_Aa=join(getindex.(sl, 4)),
Uniprot=sl[i][6],
Uniprot_Aa=join.(getindex.(sl, 7)),
Uniprot_index=parse.(Float64, getindex.(sl, 8)),
)
else
return (
PDB=getindex.(sl,1),
Chain=join.(getindex.(sl,2)),
PDB_index=parse.(Float64,getindex.(sl, 3)),
PDB_Aa=join(getindex.(sl, 4)),
Uniprot=join.("missing"),
Uniprot_Aa=join.("missing"),
Uniprot_index=join.("missing")
)
end
end
push!(df, parsefile(myfile))
But, this generates 1 row DataFrame, and the last 3 columns are “missing”. All my tries to separate the DataFrame into rows without repeating the first column were useless (I thought about doing something like if getindex.(sl,1) == getindex.(sl,i)
inside the function, and only saving the name once, but it did not work).
As usual, any advice is helpful.
Thanks a lot!