Help needed for parsing a database

Juan_Mac_Donagh · September 22, 2022, 9:12pm

Hi all, I need some help parsing a file, I can’t figure out how to do it. My data looks like this (it’s a 4gb file):

101m A        1 MET 0      P02185    M         1
101m A        2 VAL 1      P02185    V         2
101m A        3 LEU 2      P02185    L         3
101m A        4 SER 3      P02185    S         4
102l A       21 THR 21     P00720    T        26
102l A       22 GLU 22     P00720    E        27
102l A       23 GLY 23     P00720    G        28
1c1c A      189 SER 191    
1c1c A      190 ASP 192

And I need to tidy-it somehow, and convert it to a DafaFrame. I have previusly used push! for this, and I have a kind-of working example, but the problem is that I need to create a not so redudant DataFrame, that should look like this:

|C1  |C2|C3   |C4       |      C5|
|----|-|------|---------|--------|
|101m|A|P02185|[1,2,3,4]|[1,2,3,4]|
|102l|A|P00720|[21,22,23]|[26,27,28]|
|1c1c|A|missing|missing|missing|

So, what I want is to join into arrays all the columns that have the same identifier (the first column), but I am strugling with two things: first, how to to that, and second, how to manage the missing data.
I was trying to build a simple function like this:

function parsefile(filename)
    l = readlines(filename)
    sl = split.(l)

	if length.(sl) == 8 
    		return (
					PDB=getindex.(sl,1),
					Chain=join.(getindex.(sl,2)),
           			PDB_index=parse.(Float64,getindex.(sl, 3)),
					PDB_Aa=join(getindex.(sl, 4)),

					Uniprot=sl[i][6],
           			Uniprot_Aa=join.(getindex.(sl, 7)),
					Uniprot_index=parse.(Float64, getindex.(sl, 8)),
        			)
			else 
				return (
						PDB=getindex.(sl,1),
						Chain=join.(getindex.(sl,2)),
           				PDB_index=parse.(Float64,getindex.(sl, 3)),
						PDB_Aa=join(getindex.(sl, 4)),

						Uniprot=join.("missing"),
	           			Uniprot_Aa=join.("missing"),
						Uniprot_index=join.("missing")
        		)
			end
end

push!(df, parsefile(myfile))

But, this generates 1 row DataFrame, and the last 3 columns are “missing”. All my tries to separate the DataFrame into rows without repeating the first column were useless (I thought about doing something like if getindex.(sl,1) == getindex.(sl,i) inside the function, and only saving the name once, but it did not work).

As usual, any advice is helpful.

Thanks a lot!

bkamins · September 22, 2022, 9:30pm

Do you need this “not so redundant” data frame immediately (because you are memory constrained) or just want it eventually.

If it is the latter then you can groupby the original data frame by the columns you want to aggregate on and then combine the rows.

However, I am not clear how you came to the conclusion that 1c1c in C1 should be associated with P00720 as it is missing there.

Juan_Mac_Donagh · September 22, 2022, 9:33pm

I need to DataFrame-it later, once it is grouped by, if i try to do it at once, I end up crashing my machine because of the size of it (it’s like 88.000.000 rows or more).

With the later, it was a typo! That last row can be gone, because it has missing data. I will edit the question asap. Thanks!

bkamins · September 22, 2022, 9:58pm

In this case, what I would do is start with the dictionary, whose keys would be (C1, C2, C3) tuple and values would be a tuple of vectors (C4, C5).
Also, instead of missing in C4 and C5 I would rater use just an empty vector.
Next I would populate these dictionaries row by row (if a new row key is hit - adding it, and if old row key is found - appending data to the vectors.
This should be simple and efficient to populate. Later you can decide - either such a data structure is OK for you or, if not, you can relatively easily change it to a DataFrame.

bkamins · September 22, 2022, 9:59pm

As a side note - this most likely means that whatever you do will cause memory issues. What could help is e.g. narrowing the type of C4 and C5. For example if you know that the values are not larger than 255 you can use UInt8 type to store them.

Juan_Mac_Donagh · September 23, 2022, 8:11pm

Thanks! I actually ended up splitting the file in chunks (using bash), and loading it with CSV, using combine and groupby, and vcat to get the full DataFrame, in the end is what worked for me. But for a bigger dataset, I will go with this approach.

Cheers!

Topic		Replies	Views
Dealing with missing values in DataFrame (again) Data	5	1297	August 23, 2020
[ANN] DataFrameDBs.jl Data package , announcement	60	4202	May 2, 2020
Array of heterogeneous named tuples to DataFrame General Usage	2	437	December 24, 2020
Updating a data with new data, removing redundancies New to Julia	10	475	July 10, 2020
Make this more efficient? Data strings , broadcast , dataframes	11	759	March 7, 2022

Help needed for parsing a database

Related topics