How to reduce redundancy in a list inside a DataFrame

Hi all. I have two questions, one very basic and one not so basic.

I have a DataFrame that looks like this:

df = DataFrame([]) 
df.list = [[1,2,3,6,12,13,14],[20,21,23,24,50],[1,3,5,7,10,11,12,13]]

And I need to generate a new column, that only stores the β€œpoints” of each array. This means, only the first and last number in a sequence of consecutive numbers.
It would look like this:

df.new = [[1,3,12,14],[20,21,23,24],[10,13]] #it is important that they are made up of pairs!

My solution, so far, looks something like this:

	df.list2 = [[],[],[]]

	for r in eachrow(df) #loop the df
		j=0 #this is for "walking" down the array
		for i in r.list
			j+=1
			if j == 1 #append the fist number. This needs to go, maybe the fist numbers doesn't have a consecutive pair. 
				append!(r.list2, r[1][j])

			elseif j != 1 
				if (r[1][j]-1) != r[1][j-1]  #check if the number is not consecutive to the previous one 
							append!(r.list2, r[1][j])

			elseif j == length(r[1]) #This also should go. It works to append the last value, but it might be an isolated one!
					append!(r.list2,r[1][j])


				
				end
			end
		end
	end
	


	df

That kinda works. With the first and last part of the loop, I can append the β€œpoints” of each string, and with the middle I can identify some of the numbers that I need. Here is the output:

df.list2 = [[1,6,12,14],[20,23,50], [1,3,5,7,10,13]]

The problem is, first, I’m not sure how to create an empty column automatically, so I can append all my values (if I do something like df.new = missings(Array, nrow(df)), that is what I usually do, I get an error, because I am trying to append values, as it were a list).

Secondly, I could use some help with the loop. I thought about using diff, so I could save me some trouble, but I am not sure if it will improve the loop.
Right now I am still getting the β€œlone” numbers (i e 1,3,5,7 in the last list), and I don’t really know how to get rid of them, and I should find a way to append the last number in the list (in the second list, I can append the 20, 23, but I can’t append the 21, 24, or the 3, 12 on the first one).

I think I can do it with a little more wresting with the for loop, and the different conditions, but any help is welcome.

Thanks a lot! cheers

Writing a loop is fine. Your logic is sufficiently complicated to merit a loop.

But you should put things in functions to organize your code and make it readable

julia> function get_consec(x)
           t = [x[1]]
           for i in 2:length(x)
               # Consecutive with last one
               if x[i] == x[i-1] + 1
                   # If its the last one, add it
                   if i == length(x)
                       push!(t, x[i])
                   # If its the end of the chain, push it
                   elseif x[i] != x[i+1] - 1
                       push!(t, x[i])
                   end
               # Not consecutive, start a new chain
               else
                   if i != length(x)
                       if x[i] == x[i+1] -1
                           push!(t, x[i])
                       end
                   end
               end
           end
           return t
       end;

julia> get_consec(df[1,1])
4-element Vector{Int64}:
  1
  3
 12
 14

using DataFramesMeta.jl you can do this as

julia> using DataFramesMeta

julia> @rtransform df :list2 = get_consec(:list)
3Γ—2 DataFrame
 Row β”‚ list                          list2            
     β”‚ Array…                        Array…           
─────┼────────────────────────────────────────────────
   1 β”‚ [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
   2 β”‚ [20, 21, 23, 24, 50]          [20, 21, 23, 24]
   3 β”‚ [1, 3, 5, 7, 10, 11, 12, 13]  [1, 10, 13]

I guess I didn’t get the logic fully right in the function. But the point stands, write a function and use @rtransform.

1 Like

The logic is:

julia> function identify_pairs(v::AbstractVector)
           local start, stop
           initialized = false
           res = eltype(v)[]
           for x in v
               if initialized
                   if x == stop + 1
                       stop = x
                   else
                       if stop > start
                           push!(res, start, stop)
                           initialized = false
                       end
                       start = stop = x
                       initialized = true
                   end
               else
                   start = stop = x
                   initialized = true
               end
           end
           if stop >  start
               push!(res, start, stop)
           end
           return res
       end
identify_pairs (generic function with 1 method)

julia> transform(df, :list => ByRow(identify_pairs) => :new)
3Γ—2 DataFrame
 Row β”‚ list                          new
     β”‚ Array…                        Array…
─────┼────────────────────────────────────────────────
   1 β”‚ [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
   2 β”‚ [20, 21, 23, 24, 50]          [20, 21, 23, 24]
   3 β”‚ [1, 3, 5, 7, 10, 11, 12, 13]  [10, 13]
4 Likes

Thanks a lot! I should try to use functions more often, especially in this cases. I understand the logic, cheers!

Thanks!! As I said in the previous comment, I should tap more often on functions. I am marking this one as the solution. Thanks again!

1 Like

Fyi, this Matlab post provides a compact alternative, probably less efficient than Bogumil’s solution.

A possible Julia translation here into (~10% slower for your input data) :

function sortedendpoints(v)
    d = diff(eltype(v)[0; diff(v).==1; 0])
    return transpose([v[d.>0] v[d.<0]])[:]
end

transform(df, :list => ByRow(sortedendpoints) => :new) 
1 Like

This also works:

transform(df, :list => ByRow(f->filter(x->(x-1 ∈ f) ⊻ (x+1 ∈ f),f)) => :new)

(slower in run-time, faster in type-time)

3 Likes

@Dan, what a beautiful one-liner, which also shows the intention most clearly. You might consider adding it to this list.

1 Like

That’s really neat! Thanks! It still amazeme that something that in my (newbie) head was a really complex problem can be solved in many ways, and sometimes like this, in one line. Got lots to learn

Just that this one has quadratic time complexity, so if your arrays get longer you might run into problems quickly.

1 Like