How to reduce redundancy in a list inside a DataFrame

Juan_Mac_Donagh · October 11, 2022, 5:03pm

Hi all. I have two questions, one very basic and one not so basic.

I have a DataFrame that looks like this:

df = DataFrame([]) 
df.list = [[1,2,3,6,12,13,14],[20,21,23,24,50],[1,3,5,7,10,11,12,13]]

And I need to generate a new column, that only stores the “points” of each array. This means, only the first and last number in a sequence of consecutive numbers.
It would look like this:

df.new = [[1,3,12,14],[20,21,23,24],[10,13]] #it is important that they are made up of pairs!

My solution, so far, looks something like this:

	df.list2 = [[],[],[]]

	for r in eachrow(df) #loop the df
		j=0 #this is for "walking" down the array
		for i in r.list
			j+=1
			if j == 1 #append the fist number. This needs to go, maybe the fist numbers doesn't have a consecutive pair. 
				append!(r.list2, r[1][j])

			elseif j != 1 
				if (r[1][j]-1) != r[1][j-1]  #check if the number is not consecutive to the previous one 
							append!(r.list2, r[1][j])

			elseif j == length(r[1]) #This also should go. It works to append the last value, but it might be an isolated one!
					append!(r.list2,r[1][j])


				
				end
			end
		end
	end
	


	df

That kinda works. With the first and last part of the loop, I can append the “points” of each string, and with the middle I can identify some of the numbers that I need. Here is the output:

df.list2 = [[1,6,12,14],[20,23,50], [1,3,5,7,10,13]]

The problem is, first, I’m not sure how to create an empty column automatically, so I can append all my values (if I do something like df.new = missings(Array, nrow(df)), that is what I usually do, I get an error, because I am trying to append values, as it were a list).

Secondly, I could use some help with the loop. I thought about using diff, so I could save me some trouble, but I am not sure if it will improve the loop.
Right now I am still getting the “lone” numbers (i e 1,3,5,7 in the last list), and I don’t really know how to get rid of them, and I should find a way to append the last number in the list (in the second list, I can append the 20, 23, but I can’t append the 21, 24, or the 3, 12 on the first one).

I think I can do it with a little more wresting with the for loop, and the different conditions, but any help is welcome.

Thanks a lot! cheers

pdeffebach · October 11, 2022, 5:23pm

Writing a loop is fine. Your logic is sufficiently complicated to merit a loop.

But you should put things in functions to organize your code and make it readable

julia> function get_consec(x)
           t = [x[1]]
           for i in 2:length(x)
               # Consecutive with last one
               if x[i] == x[i-1] + 1
                   # If its the last one, add it
                   if i == length(x)
                       push!(t, x[i])
                   # If its the end of the chain, push it
                   elseif x[i] != x[i+1] - 1
                       push!(t, x[i])
                   end
               # Not consecutive, start a new chain
               else
                   if i != length(x)
                       if x[i] == x[i+1] -1
                           push!(t, x[i])
                       end
                   end
               end
           end
           return t
       end;

julia> get_consec(df[1,1])
4-element Vector{Int64}:
  1
  3
 12
 14

using DataFramesMeta.jl you can do this as

julia> using DataFramesMeta

julia> @rtransform df :list2 = get_consec(:list)
3×2 DataFrame
 Row │ list                          list2            
     │ Array…                        Array…           
─────┼────────────────────────────────────────────────
   1 │ [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
   2 │ [20, 21, 23, 24, 50]          [20, 21, 23, 24]
   3 │ [1, 3, 5, 7, 10, 11, 12, 13]  [1, 10, 13]

I guess I didn’t get the logic fully right in the function. But the point stands, write a function and use @rtransform.

bkamins · October 11, 2022, 5:25pm

The logic is:

julia> function identify_pairs(v::AbstractVector)
           local start, stop
           initialized = false
           res = eltype(v)[]
           for x in v
               if initialized
                   if x == stop + 1
                       stop = x
                   else
                       if stop > start
                           push!(res, start, stop)
                           initialized = false
                       end
                       start = stop = x
                       initialized = true
                   end
               else
                   start = stop = x
                   initialized = true
               end
           end
           if stop >  start
               push!(res, start, stop)
           end
           return res
       end
identify_pairs (generic function with 1 method)

julia> transform(df, :list => ByRow(identify_pairs) => :new)
3×2 DataFrame
 Row │ list                          new
     │ Array…                        Array…
─────┼────────────────────────────────────────────────
   1 │ [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
   2 │ [20, 21, 23, 24, 50]          [20, 21, 23, 24]
   3 │ [1, 3, 5, 7, 10, 11, 12, 13]  [10, 13]

Juan_Mac_Donagh · October 11, 2022, 5:41pm

Thanks a lot! I should try to use functions more often, especially in this cases. I understand the logic, cheers!

Juan_Mac_Donagh · October 11, 2022, 5:42pm

Thanks!! As I said in the previous comment, I should tap more often on functions. I am marking this one as the solution. Thanks again!

rafael.guerra · October 11, 2022, 8:09pm

Fyi, this Matlab post provides a compact alternative, probably less efficient than Bogumil’s solution.

A possible Julia translation here into (~10% slower for your input data) :

function sortedendpoints(v)
    d = diff(eltype(v)[0; diff(v).==1; 0])
    return transpose([v[d.>0] v[d.<0]])[:]
end

transform(df, :list => ByRow(sortedendpoints) => :new)

Dan · October 11, 2022, 11:48pm

This also works:

transform(df, :list => ByRow(f->filter(x->(x-1 ∈ f) ⊻ (x+1 ∈ f),f)) => :new)

(slower in run-time, faster in type-time)

rafael.guerra · October 12, 2022, 7:14am

@Dan, what a beautiful one-liner, which also shows the intention most clearly. You might consider adding it to this list.

Juan_Mac_Donagh · October 12, 2022, 12:48pm

That’s really neat! Thanks! It still amazeme that something that in my (newbie) head was a really complex problem can be solved in many ways, and sometimes like this, in one line. Got lots to learn

jules · October 12, 2022, 1:20pm

Just that this one has quadratic time complexity, so if your arrays get longer you might run into problems quickly.

Topic		Replies	Views
Counting in dataframes Data dataframes	7	1567	June 7, 2023
"Elongate" data frame based on column that contains arrays New to Julia dataframes	2	397	January 16, 2022
I have an array of 31 arrays and would like to make it a DataFrame. Need some help General Usage	4	282	April 6, 2020
How to correctly create ''slices'' of a column by using indexes of another one, in a DataFrame New to Julia dataframes	7	607	November 17, 2022
Help needed regarding making list, and modifying them, inside a DataFrame General Usage dataframes	5	334	September 14, 2022

How to reduce redundancy in a list inside a DataFrame

Related topics