Help needed regarding making list, and modifying them, inside a DataFrame

Juan_Mac_Donagh · September 13, 2022, 7:27pm

Hi all. I need helping figuring out how to make this:

I have a DataFrame that look like this:

	df = DataFrame()
	df.pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"]
	df.missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]]
	df.seqs_length = [10,10,10,7,7]

And I have been struggling with two things:
The first one, I need to create a new column that is composed by a list of 0, where the length of said list is determined by seqs_length , so the first ones should be a list of 10 0’s, and so on.

For this, I tried doing something like this:
insertcols!(df2, :x => repeat([0],inner =df.seq_length, outer= 5)), but I guess it is not the correct way of doing it, because I am getting an OutOfMemoryError() Also, here I am using outer as a number, but I want to do it for the whole DF, that stores a bunch of data.

Secondly, I was trying to figure out a way to replace this list of 0’s within certain range. I know that I can do something with replace.(), but the issue is that I need to replace the 0’s for the positions that are delimited by the df.missings_range column, so the first row should have a list that looks like this: [0,1,1,1,1,0,0,0,0,0]. Here I’m at a total loss, so any help is welcome.

The expected result (with the last column) shlould look like this:

|pdb_names |ranges_missing|
|----------------------------|
|"C1_1"|[0,1,1,1,1,0,0,0,0,0]|
|"C1_1"|[0,0,0,0,0,0,0,0,1,1]|

Thanks a lot!

digital_carver · September 13, 2022, 7:59pm

Let’s define a helper function first, that creates the output array you want for each row:

julia> function indicate_missings(missings_range, seqs_length)  
         v = zeros(Bool, seqs_length)
         v[first(missings_range):last(missings_range)] .= true
         v
       end
indicate_missings (generic function with 1 method)

This takes a single missings_range element and a single seqs_length value and returns the desired vector for it. You can test it with, for eg.:

julia> indicate_missings([2, 5], 10) |> println
Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Then, you can do a transform like this:

julia> transform(df, 
         [:missings_range, :seqs_length] => ByRow(indicate_missings) => :ranges_missing)
6×4 DataFrame
 Row │ pdb_names  missings_range  seqs_length  ranges_missing                    
     │ String     Vector{Int64}   Int64        Vector{Bool}                      
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ C1_1       [2, 5]                   10  Bool[0, 1, 1, 1, 1, 0, 0, 0, 0, …
   2 │ C1_1       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   3 │ C1_2       [9, 10]                  10  Bool[0, 0, 0, 0, 0, 0, 0, 0, 1, …
   4 │ C1_3       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   5 │ C2_1       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]
   6 │ C2_2       [1, 4]                    7  Bool[1, 1, 1, 1, 0, 0, 0]

If you have questions about any part of this, please feel free to ask!

rocco_sprmnt21 · September 14, 2022, 10:50am

Perhaps what you were trying to do could be done in the following way, it being understood that the solution of @digital_carver is the “Good” one.

julia> df = DataFrame(
        pdb_names = ["C1_1", "C1_1", "C1_2", "C1_3", "C2_1", "C2_2"],
        missings_range = [[2,5],[9,10], [9,10], [1,4], [1,4],[1,4]],
        seqs_length = [10,10,10,7,7,7])
6×3 DataFrame
 Row │ pdb_names  missings_range  seqs_length 
     │ String     Vector{Int64}   Int64
─────┼────────────────────────────────────────
   1 │ C1_1       [2, 5]                   10
   2 │ C1_1       [9, 10]                  10
   3 │ C1_2       [9, 10]                  10
   4 │ C1_3       [1, 4]                    7
   5 │ C2_1       [1, 4]                    7
   6 │ C2_2       [1, 4]                    7

julia> insertcols!(df, :x => repeat.(Ref([0]),df.seqs_length))
# or 
julia> insertcols!(df, :x => fill.(0,df.seqs_length))
6×4 DataFrame
 Row │ pdb_names  missings_range  seqs_length  x
     │ String     Vector{Int64}   Int64        Vector{Int64}
─────┼────────────────────────────────────────────────────────────────────────
   1 │ C1_1       [2, 5]                   10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   2 │ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   3 │ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   4 │ C1_3       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   5 │ C2_1       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]
   6 │ C2_2       [1, 4]                    7  [0, 0, 0, 0, 0, 0, 0]

julia> combine(df, [:x, :missings_range]=>ByRow((x,y)->x[range(y...)].=1))
6×1 DataFrame
 Row │ x_missings_range_function 
     │ SubArray…
─────┼───────────────────────────
   1 │ [1, 1, 1, 1]
   2 │ [1, 1]
   3 │ [1, 1]
   4 │ [1, 1, 1, 1]
   5 │ [1, 1, 1, 1]
   6 │ [1, 1, 1, 1]

julia> df
6×4 DataFrame
 Row │ pdb_names  missings_range  seqs_length  x
     │ String     Vector{Int64}   Int64        Vector{Int64}
─────┼────────────────────────────────────────────────────────────────────────
   1 │ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0, 0]
   2 │ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   3 │ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
   4 │ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 │ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   6 │ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]

Regarding the result of these expressions I ask who can explain why the contents of the column :x have been mutated.

A different solution … without insertcols!()


transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? 1 : 0 for e in 1:x] ))
)


transform(df, [:missings_range, :seqs_length]=>ByRow(((y,x)-> [e in range(y...) ? true : false for e in 1:x] )))

transform(df, [:missings_range, :seqs_length]=>ByRow((x,y)->reverse(bitstring(sum((^).(2,range(x...).-1))))[1:y])=>:rngs)

pdeffebach · September 14, 2022, 1:45pm

How about this?

julia> using DataFramesMeta;

julia> @rtransform df :missing_ranges = begin
           x = zeros(Int, :seqs_length)
           mi = :missings_range[1]
           mx = :missings_range[2]
           x[mi:mx] .= 1
           x
       end
6×4 DataFrame
 Row │ pdb_names  missings_range  seqs_length  missing_ranges             ⋯
     │ String     Vector{Int64}   Int64        Vector{Int64}              ⋯
─────┼─────────────────────────────────────────────────────────────────────
   1 │ C1_1       [2, 5]                   10  [0, 1, 1, 1, 1, 0, 0, 0, 0 ⋯
   2 │ C1_1       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   3 │ C1_2       [9, 10]                  10  [0, 0, 0, 0, 0, 0, 0, 0, 1
   4 │ C1_3       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
   5 │ C2_1       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]      ⋯
   6 │ C2_2       [1, 4]                    7  [1, 1, 1, 1, 0, 0, 0]
                                                           1 column omitted

Juan_Mac_Donagh · September 14, 2022, 2:45pm

Hi all! Thanks a lot. All the solutions worked great. Next time I should directly try to create a function, instead of trying to solve it only using the package and base. I marked the first one as the solution, only because it was the first one!

Thanks again, cheers

rafael.guerra · September 14, 2022, 3:03pm

Sometimes using only the package and base might be simpler:

df.ranges_missing = zeros.(Int, df.seqs_length)
for r in eachrow(df)
    mi, mx = r.missings_range
    r.ranges_missing[mi:mx] .= 1
end

Topic		Replies	Views
Identify and add missing rows dynamically in dataframe General Usage dataframes , transformers	6	855	April 9, 2024
DataFrame: how to change value of a cell without knowing the row number Data dataframes	23	7166	January 25, 2023
How to reduce redundancy in a list inside a DataFrame New to Julia dataframes	9	599	October 12, 2022
DataFrames. Cannot change missing values New to Julia	1	341	July 2, 2019
Insert dummy 0 value rows into dataframe General Usage dataframes	7	561	May 23, 2021

Help needed regarding making list, and modifying them, inside a DataFrame

Related topics