Help needed regarding making list, and modifying them, inside a DataFrame

Hi all. I need helping figuring out how to make this:

I have a DataFrame that look like this:

	df = DataFrame()
	df.pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"]
	df.missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]]
	df.seqs_length = [10,10,10,7,7] 

And I have been struggling with two things:
The first one, I need to create a new column that is composed by a list of 0, where the length of said list is determined by seqs_length , so the first ones should be a list of 10 0’s, and so on.

For this, I tried doing something like this:
insertcols!(df2, :x => repeat([0],inner =df.seq_length, outer= 5)), but I guess it is not the correct way of doing it, because I am getting an OutOfMemoryError() Also, here I am using outer as a number, but I want to do it for the whole DF, that stores a bunch of data.

Secondly, I was trying to figure out a way to replace this list of 0’s within certain range. I know that I can do something with replace.(), but the issue is that I need to replace the 0’s for the positions that are delimited by the df.missings_range column, so the first row should have a list that looks like this: [0,1,1,1,1,0,0,0,0,0]. Here I’m at a total loss, so any help is welcome.

The expected result (with the last column) shlould look like this:

|pdb_names |ranges_missing|
|----------------------------|
|"C1_1"|[0,1,1,1,1,0,0,0,0,0]|
|"C1_1"|[0,0,0,0,0,0,0,0,1,1]|

Thanks a lot!

Let’s define a helper function first, that creates the output array you want for each row:

julia> function indicate_missings(missings_range, seqs_length)  
         v = zeros(Bool, seqs_length)
         v[first(missings_range):last(missings_range)] .= true
         v
       end
indicate_missings (generic function with 1 method)

This takes a single missings_range element and a single seqs_length value and returns the desired vector for it. You can test it with, for eg.:

julia> indicate_missings([2, 5], 10) |> println
Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Then, you can do a transform like this:

julia> transform(df, 
         [:missings_range, :seqs_length] => ByRow(indicate_missings) => :ranges_missing)
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  ranges_missing                    
     β”‚ String     Vector{Int64}   Int64        Vector{Bool}                      
─────┼───────────────────────────────────────────────────────────────────────────
   1 β”‚ C1_1       [2, 5]                   10  Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, …
   2 β”‚ C1_1       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   3 β”‚ C1_2       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   4 β”‚ C1_3       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]

If you have questions about any part of this, please feel free to ask!

4 Likes

Perhaps what you were trying to do could be done in the following way, it being understood that the solution of @digital_carver is the β€œGood” one.

julia> df = DataFrame(
        pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"],
        missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]],
        seqs_length = [10,10,10,7,7,7])
6Γ—3 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length 
     β”‚ String     Vector{Int64}   Int64
─────┼────────────────────────────────────────
   1 β”‚ C1_1       [2, 5]                   10
   2 β”‚ C1_1       [9, 10]                  10
   3 β”‚ C1_2       [9, 10]                  10
   4 β”‚ C1_3       [1, 4]                    7
   5 β”‚ C2_1       [1, 4]                    7
   6 β”‚ C2_2       [1, 4]                    7

julia> insertcols!(df, :x => repeat.(Ref([0]),df.seqs_length))
# or 
julia> insertcols!(df, :x => fill.(0,df.seqs_length))
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  x
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}
─────┼────────────────────────────────────────────────────────────────────────
   1 β”‚ C1_1       [2, 5]                   10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   4 β”‚ C1_3       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]

julia> combine(df, [:x, :missings_range]=>ByRow((x,y)->x[range(y...)].=1))
6Γ—1 DataFrame
 Row β”‚ x_missings_range_function 
     β”‚ SubArray…
─────┼───────────────────────────
   1 β”‚ [1, 1, 1, 1]
   2 β”‚ [1, 1]
   3 β”‚ [1, 1]
   4 β”‚ [1, 1, 1, 1]
   5 β”‚ [1, 1, 1, 1]
   6 β”‚ [1, 1, 1, 1]

julia> df
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  x
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}
─────┼────────────────────────────────────────────────────────────────────────
   1 β”‚ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0, 0]
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   4 β”‚ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]

Regarding the result of these expressions I ask who can explain why the contents of the column :x have been mutated.

A different solution … without insertcols!()


transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? 1 : 0 for e in 1:x] ))
)

transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? true : false for e in 1:x] )))

transform(df, [:missings_range, :seqs_length]=>ByRow((x,y)->reverse(bitstring(sum((^).(2,range(x...).-1))))[1:y])=>:rngs)

2 Likes

How about this?

julia> using DataFramesMeta;

julia> @rtransform df :missing_ranges = begin
           x = zeros(Int, :seqs_length)
           mi = :missings_range[1]
           mx = :missings_range[2]
           x[mi:mx] .= 1
           x
       end
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  missing_ranges             β‹―
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}              β‹―
─────┼─────────────────────────────────────────────────────────────────────
   1 β”‚ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0 β‹―
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   4 β”‚ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]      β‹―
   6 β”‚ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
                                                           1 column omitted

2 Likes

Hi all! Thanks a lot. All the solutions worked great. Next time I should directly try to create a function, instead of trying to solve it only using the package and base. I marked the first one as the solution, only because it was the first one!

Thanks again, cheers

Sometimes using only the package and base might be simpler:

df.ranges_missing = zeros.(Int, df.seqs_length)
for r in eachrow(df)
    mi, mx = r.missings_range
    r.ranges_missing[mi:mx] .= 1
end
3 Likes