Help needed for parsing a database

Hi all, I need some help parsing a file, I can’t figure out how to do it. My data looks like this (it’s a 4gb file):

101m A        1 MET 0      P02185    M         1
101m A        2 VAL 1      P02185    V         2
101m A        3 LEU 2      P02185    L         3
101m A        4 SER 3      P02185    S         4
102l A       21 THR 21     P00720    T        26
102l A       22 GLU 22     P00720    E        27
102l A       23 GLY 23     P00720    G        28
1c1c A      189 SER 191    
1c1c A      190 ASP 192    

And I need to tidy-it somehow, and convert it to a DafaFrame. I have previusly used push! for this, and I have a kind-of working example, but the problem is that I need to create a not so redudant DataFrame, that should look like this:

|C1  |C2|C3   |C4       |      C5|

So, what I want is to join into arrays all the columns that have the same identifier (the first column), but I am strugling with two things: first, how to to that, and second, how to manage the missing data.
I was trying to build a simple function like this:

function parsefile(filename)
    l = readlines(filename)
    sl = split.(l)

	if length.(sl) == 8 
    		return (
           			PDB_index=parse.(Float64,getindex.(sl, 3)),
					PDB_Aa=join(getindex.(sl, 4)),

           			Uniprot_Aa=join.(getindex.(sl, 7)),
					Uniprot_index=parse.(Float64, getindex.(sl, 8)),
				return (
           				PDB_index=parse.(Float64,getindex.(sl, 3)),
						PDB_Aa=join(getindex.(sl, 4)),


push!(df, parsefile(myfile))

But, this generates 1 row DataFrame, and the last 3 columns are “missing”. All my tries to separate the DataFrame into rows without repeating the first column were useless (I thought about doing something like if getindex.(sl,1) == getindex.(sl,i) inside the function, and only saving the name once, but it did not work).

As usual, any advice is helpful.

Thanks a lot!

Do you need this “not so redundant” data frame immediately (because you are memory constrained) or just want it eventually.

If it is the latter then you can groupby the original data frame by the columns you want to aggregate on and then combine the rows.

However, I am not clear how you came to the conclusion that 1c1c in C1 should be associated with P00720 as it is missing there.

I need to DataFrame-it later, once it is grouped by, if i try to do it at once, I end up crashing my machine because of the size of it (it’s like 88.000.000 rows or more).

With the later, it was a typo! That last row can be gone, because it has missing data. I will edit the question asap. Thanks!

In this case, what I would do is start with the dictionary, whose keys would be (C1, C2, C3) tuple and values would be a tuple of vectors (C4, C5).
Also, instead of missing in C4 and C5 I would rater use just an empty vector.
Next I would populate these dictionaries row by row (if a new row key is hit - adding it, and if old row key is found - appending data to the vectors.
This should be simple and efficient to populate. Later you can decide - either such a data structure is OK for you or, if not, you can relatively easily change it to a DataFrame.

As a side note - this most likely means that whatever you do will cause memory issues. What could help is e.g. narrowing the type of C4 and C5. For example if you know that the values are not larger than 255 you can use UInt8 type to store them.

Thanks! I actually ended up splitting the file in chunks (using bash), and loading it with CSV, using combine and groupby, and vcat to get the full DataFrame, in the end is what worked for me. But for a bigger dataset, I will go with this approach.


1 Like