Help needed regarding making list, and modifying them, inside a DataFrame

Hi all. I need helping figuring out how to make this:

I have a DataFrame that look like this:

	df = DataFrame()
	df.pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"]
	df.missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]]
	df.seqs_length = [10,10,10,7,7] 

And I have been struggling with two things:
The first one, I need to create a new column that is composed by a list of 0, where the length of said list is determined by seqs_length , so the first ones should be a list of 10 0’s, and so on.

For this, I tried doing something like this:
insertcols!(df2, :x => repeat([0],inner =df.seq_length, outer= 5)), but I guess it is not the correct way of doing it, because I am getting an OutOfMemoryError() Also, here I am using outer as a number, but I want to do it for the whole DF, that stores a bunch of data.

Secondly, I was trying to figure out a way to replace this list of 0’s within certain range. I know that I can do something with replace.(), but the issue is that I need to replace the 0’s for the positions that are delimited by the df.missings_range column, so the first row should have a list that looks like this: [0,1,1,1,1,0,0,0,0,0]. Here I’m at a total loss, so any help is welcome.

The expected result (with the last column) shlould look like this:

|pdb_names |ranges_missing|

Thanks a lot!

Let’s define a helper function first, that creates the output array you want for each row:

julia> function indicate_missings(missings_range, seqs_length)  
         v = zeros(Bool, seqs_length)
         v[first(missings_range):last(missings_range)] .= true
indicate_missings (generic function with 1 method)

This takes a single missings_range element and a single seqs_length value and returns the desired vector for it. You can test it with, for eg.:

julia> indicate_missings([2, 5], 10) |> println
Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Then, you can do a transform like this:

julia> transform(df, 
         [:missings_range, :seqs_length] => ByRow(indicate_missings) => :ranges_missing)
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  ranges_missing                    
     β”‚ String     Vector{Int64}   Int64        Vector{Bool}                      
   1 β”‚ C1_1       [2, 5]                   10  Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, …
   2 β”‚ C1_1       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   3 β”‚ C1_2       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   4 β”‚ C1_3       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]

If you have questions about any part of this, please feel free to ask!


Perhaps what you were trying to do could be done in the following way, it being understood that the solution of @digital_carver is the β€œGood” one.

julia> df = DataFrame(
        pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"],
        missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]],
        seqs_length = [10,10,10,7,7,7])
6Γ—3 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length 
     β”‚ String     Vector{Int64}   Int64
   1 β”‚ C1_1       [2, 5]                   10
   2 β”‚ C1_1       [9, 10]                  10
   3 β”‚ C1_2       [9, 10]                  10
   4 β”‚ C1_3       [1, 4]                    7
   5 β”‚ C2_1       [1, 4]                    7
   6 β”‚ C2_2       [1, 4]                    7

julia> insertcols!(df, :x => repeat.(Ref([0]),df.seqs_length))
# or 
julia> insertcols!(df, :x => fill.(0,df.seqs_length))
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  x
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}
   1 β”‚ C1_1       [2, 5]                   10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   4 β”‚ C1_3       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]

julia> combine(df, [:x, :missings_range]=>ByRow((x,y)->x[range(y...)].=1))
6Γ—1 DataFrame
 Row β”‚ x_missings_range_function 
     β”‚ SubArray…
   1 β”‚ [1, 1, 1, 1]
   2 β”‚ [1, 1]
   3 β”‚ [1, 1]
   4 β”‚ [1, 1, 1, 1]
   5 β”‚ [1, 1, 1, 1]
   6 β”‚ [1, 1, 1, 1]

julia> df
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  x
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}
   1 β”‚ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0, 0]
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   4 β”‚ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   6 β”‚ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]

Regarding the result of these expressions I ask who can explain why the contents of the column :x have been mutated.

A different solution … without insertcols!()

transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? 1 : 0 for e in 1:x] ))

transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? true : false for e in 1:x] )))

transform(df, [:missings_range, :seqs_length]=>ByRow((x,y)->reverse(bitstring(sum((^).(2,range(x...).-1))))[1:y])=>:rngs)


How about this?

julia> using DataFramesMeta;

julia> @rtransform df :missing_ranges = begin
           x = zeros(Int, :seqs_length)
           mi = :missings_range[1]
           mx = :missings_range[2]
           x[mi:mx] .= 1
6Γ—4 DataFrame
 Row β”‚ pdb_names  missings_range  seqs_length  missing_ranges             β‹―
     β”‚ String     Vector{Int64}   Int64        Vector{Int64}              β‹―
   1 β”‚ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0 β‹―
   2 β”‚ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   3 β”‚ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   4 β”‚ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 β”‚ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]      β‹―
   6 β”‚ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
                                                           1 column omitted


Hi all! Thanks a lot. All the solutions worked great. Next time I should directly try to create a function, instead of trying to solve it only using the package and base. I marked the first one as the solution, only because it was the first one!

Thanks again, cheers

Sometimes using only the package and base might be simpler:

df.ranges_missing = zeros.(Int, df.seqs_length)
for r in eachrow(df)
    mi, mx = r.missings_range
    r.ranges_missing[mi:mx] .= 1